<h1 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
Gender Detection
</font>
</h1>

<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Problem Statement
</font>
</h2>

<p dir=rtl style="direction: ltr; text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font face="vazir" size=3>
   Today, social networks have a wide range of uses. Its first use is for entertainment and leisure. But from another perspective, social networks can be used to find behavioral patterns. For example, by analyzing the opinions of social network users, we can find the weaknesses of our business.
    <br>
    Gender is one of the influential parameters in user behavior. When faced with a topic, women will mostly react one way and men will react differently!
    <br>
    Now, in this exercise, we intend to predict the gender of Twitter and Instagram users using the information that Datak has provided us with.
</font>
</p>

<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Importing the required libraries
</font>
</h2>

In [44]:
import numpy as np
import pandas as pd 

<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Introduction to the dataset
</font>
</h2>

<p dir=ltr style="direction: ltr; text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font face="vazir" size=3>
The training dataset has 8000 rows and 10 columns. The table below provides more information about the data.
</font>
</p>

<center>
<div dir=ltr style="direction: ltr;line-height:200%;font-family:vazir;font-size:medium">
<font face="vazir" size=3>
    
|column|Description|
|:------:|:---:|
|gender|Gender (target column)|
|age|User age range|
|fullname|The name written on the social network profile|
|username|username|
|biography|User's social network biography|
|follower_count|Number of followers of the user|
|following_count|Number of users that the user follows|
|is_business|The account is a business account.|
|is_verified|The account is a verified account.|
|is_private|The account is a private account.|
    
</font>
</div>
</center>


<p dir=ltr style="direction: ltr; text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font face="vazir" size=3>
    The <code>age</code> column is not a continuous variable; it represents age categories. The table below shows how age is mapped.
</font>
</p>


<center>
<div dir=ltr style="direction: ltr;line-height:200%;font-family:vazir;font-size:medium">
<font face="vazir" size=3>
    
|Real age of user|Mapped number|
|:------:|:---:|
|Under 18|1|
|between 19 and 29|2|
|between 30 and 39|3|
|over 40|4|
    
</font>
</div>
</center>


<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Reading the dataset</font>
</h2>

In [39]:
train = pd.read_csv('data/train_data.csv')
test = pd.read_csv('data/test_data.csv')
# train.head()

<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Preprocessing and feature engineering
</font>
</h2>

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

train.is_business = train.is_business.fillna(0)


train['text_combined'] = train['fullname'] + ' ' + train['username'] + ' ' + train['biography']


tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')


tfidf_matrix = tfidf_vectorizer.fit_transform(train['text_combined'])


tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())


combined_df = pd.concat([train[['age', 'follower_count', 'following_count', 'is_business', 'is_verified', 'is_private']], tfidf_df], axis=1)


In [46]:
test.is_business = test.is_business.fillna(0)


test['text_combined'] = test['fullname'] + ' ' + test['username'] + ' ' + test['biography']

tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')

tfidf_matrix = tfidf_vectorizer.fit_transform(test['text_combined'])

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

combined_df_test = pd.concat([test[['age', 'follower_count', 'following_count', 'is_business', 'is_verified', 'is_private']], tfidf_df], axis=1)


common_columns = list(set(combined_df.columns) & set(combined_df_test.columns))
x_train = combined_df[common_columns]
y_train = train.gender
x_test = combined_df_test[common_columns]



<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Modeling
</font>
</h2>

In [47]:
# modeling :)
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=30)
model.fit(x_train,y_train);

<h3 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Evaluation criteria
</font>
</h3>


In [48]:
# evaluate model :)
from sklearn.metrics import f1_score

f1_score(y_train,model.predict(x_train),pos_label='man')


0.8422590068159689

<p dir=ltr style="direction: ltr; text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font color="red"><b color='red'>Attention:</b></font>
<font face="vazir" size=3>
To earn full points, your answer needs to score at least <code>75%</code> according to the introduced criteria.
</font>
</p>