Introduction

India is home to more than 1.2 billion individuals who come from different regional backgrounds having different mother tongues. There are 1369 rationalized languages and dialects that are spoken across the country as per the data obtained in 2011. In addition to this, as per Statista’s report of 2020, India has also the second-largest (and possibly will have the largest) digital footprint.

Even after such a humongous user base exists in the country, less research has been performed on these languages and the existing multilingual detection systems perform poorly considering the regional languages of India in particular.

Moj - one of India’s largest short-video apps has its roots spread in various regions of the country offering videos in multiple regional languages. While people share their creativity, talent, and experiences on the application, a major issue that has been noticed is that the comments on the application are abusive, offensive, and hateful. It is imperative that since the videos are being made in regional Indian languages, the comments are also written in regional languages and/or are a mixture of two or more languages (multilingual).

The challenge is to develop a solution with the help of Artificial Intelligence to tackle the problem of abusive comments in multilingual Indian languages and detect them.

One important fact to be noticed is that this problem is not only limited to the Moj application. All social media platforms have the issue of abusive comments in multiple languages prevalent. Research is being performed in both areas - identification of multilingual text and detection of abusive and hateful speech contained within the content shared on social media. Unless we are able to detect the hateful content, it is impossible to stop it from spreading around. For making cyberspace a safe space for everyone alike, it is vital that everyone possesses the freedom of expressing their opinions without the fear of people thrashing or abusing them or passing on hateful comments.

How To Use This Repository?

Step 1: Clone this repository

git clone https://github.com/abhi27-web/IndoML-Datathon.git

Step 2: Navigate to the file named Final Submission Code.ipynb

(Refer to the highlighted yellow portion in the image given below)

Step 3: Open IndoML Multilingual Abusive Comment Detection Kaggle Challenge in your browser

Step 4: Navigate to "Code" on the webpage and click on New Notebook

Step 5: Upload the Final Submission Code.ipynb notebook

Step 6: Run the cells of the notebook to get the desired results

Conclusion

In this project we have done the in-depth analysis of multilingual abusive comment detection problem with ML-models and NLP- Model, our investigation has shown the following results:

Models like Logistic Regression, Random Forest, Decision Trees, K-Nearest Neighbors, Gradient Boosting were used with 2 types of vectorization, which were BOW approach and Tf-Idf Approach were used for ML models and BERT was used for NLP model
The best performing models were BERT LR-TF-IDF and RF-TF-IDF with accuracies of 0.84, 0.83 and 0.81 respectively.
The worst performing models were GB-BOW and KN-BOW with accuracies of 0.70 and 0.77 respectively.
Finally we got the best accuracy of 0.93076 using the Logistic Regression Algorithm along with IndicNLP tokenized TFIDF.

Standing In The Competition

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
ML Algorithms		ML Algorithms
Modified Data		Modified Data
Data Visualization.ipynb		Data Visualization.ipynb
Final Submission Code.ipynb		Final Submission Code.ipynb
Ml Algorithm-bert-base-multilingual-cased.ipynb		Ml Algorithm-bert-base-multilingual-cased.ipynb
Original Data.zip		Original Data.zip
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML Algorithms

ML Algorithms

Modified Data

Modified Data

Data Visualization.ipynb

Data Visualization.ipynb

Final Submission Code.ipynb

Final Submission Code.ipynb

Ml Algorithm-bert-base-multilingual-cased.ipynb

Ml Algorithm-bert-base-multilingual-cased.ipynb

Original Data.zip

Original Data.zip

README.md

README.md

Repository files navigation

Introduction

How To Use This Repository?

Conclusion

Standing In The Competition

About

Releases

Packages

Contributors 5

Languages

abhi27-web/IndoML-Datathon

Folders and files

Latest commit

History

Repository files navigation

Introduction

How To Use This Repository?

Conclusion

Standing In The Competition

About

Resources

Stars

Watchers

Forks

Languages