# Quotes classification

In this notebook, we treat each quote of the corpus and assign them a score, or a binary variable determining if the quote is formal or informal according the dictionnary of informal formulation and slang vocabulary

In [1]:
#packages
import pandas as pd
from tqdm import tqdm, tqdm_notebook
tqdm_notebook().pandas()


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  tqdm_notebook().pandas()


0it [00:00, ?it/s]

### Comparison fonctions
The following function are used to extract the quotes with word present in our dictionnary of slang/informal words. It consist simply to search for a specific string in the dico and updating the dict with the number of time the word had been found in the quotes. This value will be used latter in the process.

In [2]:
#function to check if strin is contained and update dico
def isinside1(test_string,test_list):
    global dico
    res = [ele for ele in test_list if(ele in test_string)]
    #print(res)
    if res:
        for ele in res:
            i= test_list.index(ele)
            dico["occurences"].loc[i]= dico["occurences"].loc[i]+1
        return 1
    return 0

def isinside2(test_string,test_list):
    if any(ext in test_string for ext in test_list):
        return 1
    return 0

### The dictionnary
The dictionnary is compiled in the notebook *dictionnary_Merging_Cleaning.ipynb* and his compiled with this 3 different sources:
- [Urban Dictionnary](https://www.urbandictionary.com/define.php?term=Urban%20Dictionary)
- [Informal English Vocabulary](https://www.englisch-hilfen.de/en/words/informal2.htm)
- [A Dictionary of English Slang & Colloquialisms](http://www.peevish.co.uk/slang/english-slang/b.htm)

In [3]:
#get dico
dico= pd.read_pickle("./Final_Dictionary.pkl")
dico["occurences"]=0
dico.sample(10)

Unnamed: 0,word,occurences
19773,shell out,0
3015,jank,0
1199,Wah,0
12234,re,0
19814,shit (somone),0
19193,pillow-biter,0
7054,BEDSHITTER,0
5661,Uggdugg,0
10361,lesbians,0
19418,purple-headed warrior,0


Adding one column on the expression/word length which could be useful for further statistics. It is also the time to check for any duplicates which should not be the case as the dico was cleaned in the *Dictionaries_Merging_Cleaning.ipynb* notebook but you can never be too careful

In [5]:
#check strings lenths and describe dico
dico["strlen"]= dico["word"].apply(lambda x : len(x))
#make list out of dico
dicolist= dico["word"].unique().tolist()
dico.describe()

Unnamed: 0,occurences,strlen
count,20989.0,20989.0
mean,0.0,8.81052
std,0.0,4.998863
min,0.0,1.0
25%,0.0,6.0
50%,0.0,8.0
75%,0.0,11.0
max,0.0,143.0


### Comparison between the dictionnary and the Quotebank sample
Importation of a 600'000 cleaned sample extracted from the original Quotbank database. For the data wrangling process check the notebook *Data Wrangling Quotebank.ipynb* in the folder DATAWRANGLING of the git.

In [6]:
#ALEX: C:/Users/alexb/Documents/Ecole/EPFL/MasterII/ADA/Sample_cleaned_1Mio.json.bz2
#NICO: "/Users/nicolasantacroce/Desktop/Desktop/EPFL/EPFL MA1/Applied Data Analysis/Sample.json.bz2"
#importing dataset sample
df= pd.read_json("C:/Users/alexb/Documents/Ecole/EPFL/MasterII/ADA/Sample_cleaned_1Mio.json.bz2",compression="bz2",lines=True)

Separation of the quotes using the above described function. This is the most time consuming step (about 0.03 [s/quotes])

In [7]:
#separating quotes (really long so we try it on the first 10000)
#df2= df.loc[0:70000]
df2 = df
df2["colloquial"]= df2["quotation"].progress_apply(lambda x : isinside1(x,dicolist))

  0%|          | 0/668534 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Sanity check, all fields are filled and we can see that 99.99% of the quotes are categorized as colloquial

In [8]:
df2.describe()

Unnamed: 0,numOccurrences,p1,p2,delta_p,year,colloquial
count,668534.0,668534.0,668534.0,668534.0,668534.0,668534.0
mean,3.55898,0.818245,0.161096,0.657149,2017.536327,0.999957
std,22.64603,0.095738,0.081709,0.173361,1.770882,0.006586
min,1.0,0.5001,0.0086,0.3,2015.0,0.0
25%,1.0,0.7474,0.0934,0.5219,2016.0,1.0
50%,1.0,0.8301,0.1526,0.6749,2018.0,1.0
75%,2.0,0.8973,0.2219,0.8022,2019.0,1.0
max,12086.0,0.9908,0.35,0.9821,2020.0,1.0


The fact that all quotes are categorized as colloquial is not surprising given the size of the dictionary used (>20'000 entries). Some common words are included, but not all of them are categorized as colloquial. This is why we will focus on the less frequently occurring terms, which will define a clearer distinction between formal and informal language. 

### Dictionnary reduction
The threshold was set at 0.001. This means that the expression must not appear in more than one quote out of a thousand to remain in the dictionary.

In [9]:
#removing words present in more than 0.1% of quotes
tresh= 0.001
dico["occurence_fraction"]= dico["occurences"]/df2["colloquial"].count()
dico.describe()

Unnamed: 0,occurences,strlen,occurence_fraction
count,20989.0,20989.0,20989.0
mean,816.276716,8.81052,0.001221
std,12365.917242,4.998863,0.018497
min,0.0,1.0,0.0
25%,0.0,6.0,0.0
50%,0.0,8.0,0.0
75%,2.0,11.0,3e-06
max,639634.0,143.0,0.956771


In [10]:
#creating new dictionary without most occuring words
dico2= dico[dico["occurence_fraction"]<tresh]
dicolist2= dico2["word"].unique().tolist()
dico2.describe()

Unnamed: 0,occurences,strlen,occurence_fraction
count,20116.0,20116.0,20116.0
mean,16.672947,9.011981,2.5e-05
std,68.375757,4.997072,0.000102
min,0.0,1.0,0.0
25%,0.0,6.0,0.0
50%,0.0,8.0,0.0
75%,1.0,11.0,1e-06
max,666.0,143.0,0.000996


We use the same functions as before to search in the quotes but this time with a reduced dictionnary

In [11]:
#re-evaluating quotes with reduced dictionary
df2["colloquial"]= df2["quotation"].progress_apply(lambda x : isinside2(x,dicolist2))

  0%|          | 0/668534 [00:00<?, ?it/s]

In [12]:
df2.describe()
df2.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,p1,p2,delta_p,year,colloquial
0,2015-11-11-109291,They'll call me lots of different things. Libe...,Chris Christie,[Q63879],2015-11-11 00:55:12,1,"[[Chris Christie, 0.7395], [Bobby Jindal, 0.15...",[http://thehill.com/blogs/ballot-box/259760-ch...,E,0.7395,0.1505,0.589,2015,0
1,2015-09-11-070666,It's kind of the same way it's been with the R...,Niklas Kronwall,[Q722939],2015-09-11 19:54:00,1,"[[Niklas Kronwall, 0.7119], [None, 0.2067], [H...",[http://redwings.nhl.com/club/news.htm?id=7787...,E,0.7119,0.2067,0.5052,2015,0
2,2015-11-09-033345,I had a chuckle: They were showing a video of ...,Kris Draper,[Q948695],2015-11-09 00:57:45,3,"[[Kris Draper, 0.8782], [None, 0.1043], [Serge...",[http://ca.rd.yahoo.com/sports/rss/nfl/SIG=13u...,E,0.8782,0.1043,0.7739,2015,1
3,2015-09-05-038628,New Zealand will go in with a lot of confidenc...,John Eales,[Q926351],2015-09-05 02:40:10,3,"[[John Eales, 0.7896], [None, 0.2006], [Toutai...",[http://www.stuff.co.nz/sport/rugby/all-blacks...,E,0.7896,0.2006,0.589,2015,0
4,2015-02-11-042325,In his suicide note he even made a joke thanki...,Pat Buckley,"[Q19956564, Q23006312, Q7143252, Q7143253]",2015-02-11 09:59:09,1,"[[Pat Buckley, 0.8816], [None, 0.1184]]",[http://independent.ie/life/health-wellbeing/m...,E,0.8816,0.1184,0.7632,2015,1


Now we can display the dictionary with the updated occurence_fraction columns. It allows to have a first overview on the type of words present in the sample without being too frequent (in this case 1 quote on 1000). Of course we can see that single words are in the majority but we also find some expressions built in several words in the top of this ranking.

In [13]:
dico2 = dico2.sort_values(by=['occurence_fraction'], ascending=False)
dico2.head(60)

Unnamed: 0,word,occurences,strlen,occurence_fraction
14693,Stu,666,3,0.000996
15642,email,666,5,0.000996
2975,htf,666,3,0.000996
1027,dish,664,4,0.000993
57,warn,663,4,0.000992
20299,suit,662,5,0.00099
3810,boat,661,4,0.000989
19591,ruck,660,5,0.000987
2133,dip,660,3,0.000987
18698,mits,659,4,0.000986


Finally we can export de data to proceed at the next steps of the project that consist on:

- try do learn more about the data we have with the below sample
- prepare adequate notebooks to underline the insides of the data once we run the whole process with the entire data frame
- get more hints about the story we could tell 

In [None]:
#compression in bz2 format
df2.to_json(f"C:/Users/alexb/Documents/Ecole/EPFL/MasterII/ADA/Sample_classified_1Mio_v1.json.bz2",compression="bz2",lines=True,orient="records")