#**Project Omega**

#Project's Goal
The main research goal is to set a program that will predict the gender of the author of a tweet and we could extend the model to any text.<br><br>
To do so, we will use some text analytics techniques seen in class. Those techniques are tokenization, BOW (bag of words), TF-IDF and stylometrie.<br><br>
In term of business, with this idea we aim to enhence marketing targeting. Third party advertisement companies can have better knowledge on Twitter’s gender distribution. Thus, they can advise companies that want to make advertisement on social network like Twitter. 


##Importation and Installation of methods
For this project, we will use different packages. The main part will be about text mining. For this, we will use packages seen in class, which are *spacy*, *nltk* and *enchant*. 

In [1]:
import nltk
!pip install spacy
!apt install -qq enchant
!pip install pyenchant
!pip install nltk
!python -m spacy download en
nltk.download('punkt')
nltk.download('wordnet')

The following packages were automatically installed and are no longer required:
  cuda-cufft-10-1 cuda-cufft-dev-10-1 cuda-curand-10-1 cuda-curand-dev-10-1
  cuda-cusolver-10-1 cuda-cusolver-dev-10-1 cuda-cusparse-10-1
  cuda-cusparse-dev-10-1 cuda-drivers cuda-license-10-2 cuda-npp-10-1
  cuda-npp-dev-10-1 cuda-nsight-10-1 cuda-nsight-compute-10-1
  cuda-nsight-systems-10-1 cuda-nvgraph-10-1 cuda-nvgraph-dev-10-1
  cuda-nvjpeg-10-1 cuda-nvjpeg-dev-10-1 cuda-nvrtc-10-1 cuda-nvrtc-dev-10-1
  cuda-nvvp-10-1 default-jre dkms freeglut3 freeglut3-dev
  keyboard-configuration libargon2-0 libcap2 libcryptsetup12 libcublas10
  libdevmapper1.02.1 libfontenc1 libgtk2.0-0 libgtk2.0-common libidn11
  libip4tc0 libjansson4 libnvidia-cfg1-440 libnvidia-common-430
  libnvidia-common-440 libnvidia-decode-440 libnvidia-encode-440
  libnvidia-fbc1-440 libnvidia-gl-440 libnvidia-ifr1-440 libpam-systemd
  libpolkit-agent-1-0 libpolkit-backend-1-0 libpolkit-gobject-1-0 libxfont2
  libxi-dev libxkbfile1 lib

True

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
import spacy
from spacy.lang.en import English
import enchant
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score

##Data Cleaning <br>
We first began by separate tweet by gender. We made new tables, one for male, one for female, one for brand and one with the unknown gender. <br>
We are not interested by all the feautres (columns), we kept the "gender", the "text" and the "name" for the brand. <br>
We chose to split the data by gender to be able to construct general BOW for each gender.

In [0]:
#Import data
data = pd.read_csv('https://raw.githubusercontent.com/XaviJunior/omega/Data/Data/gender-classifier-DFE-791531.csv',encoding="latin-1")
textgender=data[["gender","text"]]
X=textgender["text"]
y=textgender["gender"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=14)
train= pd.concat([X_train.reset_index(drop='Tru‌​e'),y_train.reset_index(drop='Tru‌​e')],axis=1)
test=pd.concat([X_test.reset_index(drop='Tru‌​e'),y_test.reset_index(drop='Tru‌​e')],axis=1)
#New Table with Male Only and Split in Test/Train Data
Male = train[train['gender'] == 'male']
Maletext=Male["text"]
ym=Male["gender"]
ListMale=Maletext.values.tolist()

#New Table with Female Only and split in Test/Train Data
Female = train[train['gender'] == 'female']
Femaletext=Female["text"]
yf=Female["gender"]
ListFemale=Femaletext.values.tolist()

#New Table with Unknown Only
Unknown=train[train['gender'] == 'unknown']
Unktext=Unknown["text"]
ListUnk=Unktext.values.tolist()

#New Table with Brand Only
Brand = data[data['gender'] == 'brand']
Brandtext=Brand[["text"]]
ListBrand=Brandtext.values.tolist()

###Merging
Now that we create different table for both gender. We will merge all the tweets in two documents, for one male and one for female. We do that because we are interested to know which words are the most used by male and female.

In [3]:
#Merge all Male Text in a single string
TextMale=" "
for i in range(0,len(ListMale)):
  TextMale=TextMale+ListMale[i]
textmale=TextMale.lower()

#Merge all Female Text in a single string
TextFemale=" "
for i in range(0,len(ListFemale)):
  TextFemale=TextFemale+ListFemale[i]
textfemale=TextFemale.lower()
textfemale



In [0]:
def cleaning(input):
  nlp = English()
  Doc=nlp(input)
  Token = []
  for token in Doc:
    Token.append(token.text)
  Words=[]
  d = enchant.Dict("en_US")
  Text=''
  for i in Token:
    if d.check(i)==True:
      Text+=' '+i
  doc=nlp(Text)
  for word in doc:
    if word.is_stop==False and word.is_punct==False:
      Words.append(word)
  words=''
  for j in Words:
    o=str(j)
    lemm=lemmatizer.lemmatize(o)
    words+=lemm+' '
  words=words.lower()
  return(words)

def classifier(clean):
  clean=cleaning(clean)
  y_pred=[]
  text=[Fem,Mal,clean]
  count = CountVectorizer()
  bow = count.fit_transform(text)
  cosine =cosine_similarity(bow[2],bow)
  if cosine[0][0]>=cosine[0][1]:
    y_pred.append("female")
  else:
    y_pred.append("male")
  return y_pred

def bagofwords(clean):
  clean=cleaning(clean)
  text=[Fem,Mal,clean]
  count = CountVectorizer()
  bow = count.fit_transform(text)
  feature_names = count.get_feature_names()
  BagOfWords=pd.DataFrame(bow.todense(),columns=feature_names, index=['Female','Male','Test'])
  return BagOfWords.head()

def removekey(d, key):
  r = dict(d)
  del r[key]
  return r



def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

def jacc_class(clean):
  clean=cleaning(clean)
  y_pred=[]
  simFem=jaccard_similarity(clean,Fem)
  print(simFem)
  simMal=jaccard_similarity(clean,Mal)
  print(simMal)
  if simFem>=simMal:
    y_pred.append('female')
  else:
    y_pred.append('male')
  print(y_pred)
  return y_pred

###Tokenization & Cleaning
Then with those two documents, we have to do some tokenization. The tokenization will allow us to separate each word and then to be able to calulate the frequencie of each single word. The tokenization's technique chosen is the white-space one.<br>
Now that we have tokens, we to clean the lists by removing stop words and other tokens such as "\x89\x9d_\x95ü\x8f\x89\x9d_\x95ü\x8f\x89\x9d_\x95ü\x8f". To do so, we will use a stop words list available in *spacy* and filter words that are not in the english dictionnary available on *pyenchant*. We know that by using an ensglish dictionnary we may lose some expression that are used on twitter. But for us, it was the only sustainable solution to filter non-sense tokens.

In [0]:
Fem=cleaning(textfemale)
Mal=cleaning(textmale)

In [0]:
Fem1=Fem.split(' ')
Mal1=Mal.split(' ')

In [71]:
import collections
import operator
F=collections.Counter(Fem1)
M=collections.Counter(Mal1)
F=removekey(F,'')
M=removekey(M,'')
M = sorted(M.items(), key=operator.itemgetter(1),reverse=True)
F = sorted(F.items(), key=operator.itemgetter(1),reverse=True)
print(F)
print(M)



##Bag Of Words
Here we are going to create a table with a line per gender. The columns are all the words used on the *train tweet* of our dataset. As we merge all the tweet, it would have made no sense to set bag of words of n-gramns other thant one.

In [0]:
#The function used need string as input, we re-transform or list of lemmatizer words in a string.
text=[Fem,Mal]

# using default tokenizer 
count = CountVectorizer()
bow = count.fit_transform(text)

# Get feature names
feature_names = count.get_feature_names()

In [74]:
BagOfWords=pd.DataFrame(bow.todense(),columns=feature_names, index=['Female','Male'])
BagOfWords

Unnamed: 0,00,00 315,00 brand,00 giveaway,00 trimmed,00 use,000,000 31st,000 active,000 closest,000 day,000 diet,000 downloads,000 fan,000 hole,000 like,000 men,000 revenue,000 song,000 stream,000 video,000 view,000 worker,00000000000000002344,00000000000000002344 second,007,007 movie,01,01 05,017,017 track,044,044 sent,05,05 learn,06,06 old,07,07 kinda,07840458711,...,zenith breaking,zero,zero difference,zero fuck,zero know,zero story,zero support,zillion,zillion question,zip,zip file,zipped,zipped piped,zodiac,zodiac 1824,zombie,zombie apocalypse,zombie break,zombie come,zombie crazy,zombie custom,zombie die,zombie dream,zombie game,zombie movie,zombie pal,zone,zone help,zoning,zoning cc,zonked,zonked missed,zoo,zoo best,zoo today,zoo weekend,zoom,zoom best,zoom past,zoom rival
Female,2,0,0,1,0,1,6,1,1,1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,1,0,0,1,1,0,0,1,1,1,1,1,1,0,0,1,...,1,2,1,1,0,0,0,1,1,1,1,1,1,0,0,4,0,0,0,1,0,1,1,0,1,0,0,0,0,0,1,1,1,0,0,1,2,1,1,0
Male,3,1,1,0,1,0,10,0,0,0,1,0,1,1,0,1,1,0,1,1,1,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0,1,1,0,...,0,3,0,0,1,1,1,0,0,0,0,0,0,1,1,8,3,1,1,0,1,0,0,1,0,1,1,1,1,1,0,0,2,1,1,0,1,0,0,1


##TF-IDF
We try to apply TF-IDF, but in our case, as we merged all tweets by gender, we only have two documents. The *IDF* part makes no sense in this case.

In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer 
tfidf = TfidfVectorizer(ngram_range=(1, 1))
features = tfidf.fit_transform(text)
TFIDF=pd.DataFrame(features.todense(),columns=tfidf.get_feature_names(),index=['Female','Male'])
TFIDF

Unnamed: 0,00,000,00000000000000002344,007,01,017,044,05,06,07,07840458711,09,10,100,1000,100th,103,105,107,108,1091,10th,11,110,1115,114,116,11th,12,120,120315,121,13,138,13th,14,140,1408,144,15,...,yang,yard,yawn,yea,yeah,year,yell,yelled,yelling,yellow,yep,yes,yesterday,yeti,yield,yo,yoga,yogurt,young,younger,youngest,yous,youth,yr,yummy,yup,zap,zen,zenith,zero,zillion,zip,zipped,zodiac,zombie,zone,zoning,zonked,zoo,zoom
Female,0.001898,0.005693,0.001334,0.0,0.001334,0.0,0.001334,0.001334,0.001334,0.0,0.001334,0.001334,0.03226,0.008539,0.002846,0.001334,0.0,0.0,0.002667,0.001334,0.001334,0.000949,0.007591,0.0,0.0,0.0,0.0,0.001334,0.013283,0.002667,0.0,0.000949,0.003795,0.0,0.000949,0.002846,0.0,0.0,0.001334,0.010437,...,0.0,0.000949,0.0,0.001898,0.028465,0.091086,0.001334,0.000949,0.004001,0.001898,0.005693,0.030362,0.012335,0.001334,0.0,0.006642,0.004001,0.004001,0.003795,0.001898,0.001334,0.001334,0.001898,0.004744,0.003795,0.000949,0.0,0.001334,0.001334,0.001898,0.001334,0.001334,0.001334,0.0,0.003795,0.0,0.0,0.001334,0.000949,0.001898
Male,0.003381,0.01127,0.0,0.001584,0.0,0.001584,0.0,0.0,0.0,0.001584,0.0,0.0,0.036063,0.022539,0.003381,0.0,0.001584,0.001584,0.0,0.0,0.0,0.003381,0.007889,0.003168,0.001584,0.001584,0.001584,0.0,0.01127,0.0,0.001584,0.001127,0.004508,0.003168,0.003381,0.005635,0.004752,0.003168,0.0,0.015777,...,0.001584,0.003381,0.001584,0.002254,0.038317,0.118331,0.0,0.002254,0.0,0.001127,0.005635,0.039444,0.012397,0.0,0.001584,0.019158,0.0,0.0,0.021412,0.002254,0.0,0.0,0.004508,0.007889,0.001127,0.001127,0.001584,0.0,0.0,0.003381,0.0,0.0,0.0,0.001584,0.009016,0.001584,0.001584,0.0,0.002254,0.001127


##Cleaning of the Test Data
This part is a little bit more technical as the first cleaning. In the first case, we merged all tweet so we had only two documents. <br>
Now, we want to clean tweet by tweet and have several clean documents. As we cleaned the *train* set, we need to apply the same method to the *test* set. The goal of this section is to build a list of *cleaned tweets* by gender.

In [0]:
TestSetM = test[test['gender'] == 'male']
TestSetF = test[test['gender'] == 'female']
TestSet=TestSetM.append(TestSetF)
TestSet1=TestSet.sample(frac=1)
X_test=TestSet1["text"]
y_test=TestSet1["gender"]
Test=X_test.values.tolist()
Test_Clean=[]
for i in Test:
  Test_Clean.append(cleaning(i))

Now, we reached our goal of having a list of all the *cleaned* tweet.
We can begin to do some classification. The first method that we will use, is the make the classification according the cosine similarity. We saw in the lectures that this method is often use to do text similarity.<br>


In [0]:
y_pred=[]
for i in Test_Clean:
  y_pred.append(classifier(i))

In [0]:
accuracy_score(y_pred,y_test)

0.5627688697692609

In [0]:
import collections
counter=collections.Counter(y_test)
print(counter)

Counter({'female': 1325, 'male': 1232})


In [0]:
if counter["female"]>counter["male"]:
  BaseRate=counter["female"]/(counter["female"]+counter["male"])
else:
  BaseRate=counter["male"]/(counter["female"]+counter["male"])
print(BaseRate)

0.5181853734845522


In [11]:
Tweet=input("Enter a Tweet: ")
print('This tweet was probably writtent by a:',classifier(Tweet))
bagofwords(Tweet)
jacc_class(Tweet)

Enter a Tweet: Hello, its Beyonce'. World Humanitarian day is 4 days away! August 19th. What will your act of kindness be? #IWASHERE #WHD2012
This tweet was probably writtent by a: ['female']
