Gender_Identification_From_TWEET

Data Set -https://www.kaggle.com/datasets/crowdflower/twitter-user-gender-classification/code

Summary:-

Cleaning Dataset: • Unnecessary columns like, '_unit_id', '_last_judgment_at', 'user_timezone', 'tweet_coord', 'tweet_count', 'tweet_created', 'tweet_id', 'tweet_location', 'profileimage', 'created' were dropped. • Rows with unknown gender and no gender were removed. • Profile attributes- 'profile_yn', 'profile_yn:confidence', 'profile_yn_gold' were removed as they were unavailable. • Rows with confidence of labeling gender<100% were removed.

Manipulating Text Data:

• Text was normalized-(everything was converted to lower case, and URLs, special characters and double spaces were removed. • The most common words which were meaningless in terms of sentiment (called stopwords) were removed.

Lemmatization:

• Words which expressed same positivity were reduced to their roots using Porter algorithm. • Two tokenizers, a regular one and one that performs steaming, were used to break down the tweets into individual words.

Exploratory Data Analysis:

The answers to the following questions were explored:

What are the most common emotions/words used by Males and Females? Most common words used by: a) Females- im, like, get b) Males- like, get, im c) Brands- weather, get, updates
Which are the most frequently used link colours by Males amd Females? Most frequently used link colours by: a) Males- 0084B4, 009999, 3B94D9 b) Females- 0084B4, 9266CC, F5ABB5

Visualization:

A countplot was created to visualize the amount of each Gender.
A bar plot was created to visualize the amount of retweets.
A bar plot was created to visualize colors attributes.

Classification models with Tweet-text only:

 Independent variables- Text, Description.  Dependent variable- Gender. Firstly, the categorical labels were converted into numerical ones and it was encoded using LabelEncoder. The data was split into train and test. • Logistic Regression Model:  Accuracy obtained- 59.99517141477547% • Random Forest:  Accuracy obtained- 56.76001931434089 % • SVM:  Accuracy obtained- 59.82617093191694 %  Best Accuracy: Logistic Regression Model Classification models with content of Description added to text: To increase the accuracy further the ‘description’ was concatenated with ‘text’ and training dataset was re-created. • Logistic Regression Model:  Accuracy obtained- 68.15548044422984 % • Random Forest:  Accuracy obtained- 64.48575567358764 % • SVM:  Accuracy obtained- 68.68662481892805 % Naive Bayes Gave Accuracy of 69. .48575562984 Best Accuracy: Naïve Bayes Ensemble Modelling: Ensemble technique was used to take advantage of all the three models.  Accuracy obtained: 69.97633993239981 %

Conclusion:

The results show that the Tweet text yields a moderate accuracy, but with the content from the Description, the performance of classification models significantly improves yielding much better accuracy. Implementing Ensemble Modelling slightly increases the accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Daggers_Code.ipynb		Daggers_Code.ipynb
Presentation of Analysis ppt-pdf.pdf		Presentation of Analysis ppt-pdf.pdf
Project Summary and Conclusion.docx		Project Summary and Conclusion.docx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gender_Identification_From_TWEET

Summary:-

Manipulating Text Data:

Lemmatization:

Exploratory Data Analysis:

Visualization:

Classification models with Tweet-text only:

Conclusion:

About

Releases

Packages

Languages

divyanshusingh5/Gender_Identification_From_TWEET

Folders and files

Latest commit

History

Repository files navigation

Gender_Identification_From_TWEET

Summary:-

Manipulating Text Data:

Lemmatization:

Exploratory Data Analysis:

Visualization:

Classification models with Tweet-text only:

Conclusion:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages