### Requirements

In [None]:
import pandas as pd
import bz2
import json
from pyspark.sql import SparkSession
import pyspark.sql.functions as sf
import requests
import findspark

In [None]:
findspark.init('/Users/tatianacogne/spark')

### Objectives M2
- That you can handle the data in its size.
- That you understand what’s in the data (formats, distributions, missing values, correlations, etc.).
- That you considered ways to enrich, filter, transform the data according to your needs.
- That you have a reasonable plan and ideas for methods you’re going to use, giving their essential mathematical details in the notebook.
- That your plan for analysis and communication is reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.

### Test with PySpark
https://spark.apache.org/docs/latest/sql-programming-guide.html

In [None]:
# Create a spark context
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

# Read JSON file into dataframe
df = spark.read.json('data/quotes-2020.json.bz2')

In [None]:
findspark.init() 

In [None]:
df.show()

## A .Understanding of what’s in the data (formats, distributions, missing values, correlations, etc.).

### A1. Formats of the data

#### Summary Columns
- **quoteID**:      Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
- **quotation**:    Text of the longest encountered original form of the quotation
- **speaker**:      Selected most likely speaker
- **qids**:         Wikidata IDs of all aliases that match the selected speaker
- **date**:         Earliest occurrence date of any version of the quotation
- **numOccurences**:Number of time this quotation occurs in the articles
- **probas**:       Array representing the probabilities of each speaker having uttered the quotation
- **urls**:         List of links to the original articles containing the quotation
- **phase**:        Corresponding phase of the data in which the quotation first occurred (A-E)
- **domains**:      Domain of the URL 

In [None]:
df.printSchema()

### A2. Distributions
- Idea: Distribution of the words ? Or maybe later ?

### A3. Missing Values

### A4. Correlation
- Idea: slides du cours ? Je ne sais pas si on doit déjà commencer à faire ce genre d'analyse

## B. Ways to enrich, filter, transform the data according to your needs.

### B1. Task
- Quotes: need to use different functions in order filter the quotes before analyzing
    - remove stop words (and, the, ...)
    - stemming and lemming the quotes
    - use NLTK function in order to categorize the words in the sentence for example
- Speakers: 
    - need to keep only the speakers different form "None"
        - is it resonnable tp drop the None speakers ? Not a too big percentage of the dataset ? 
    - need to regroup speakers like "President Donald Trump" and "Donald Trump"
    - add columns with the occupations/jobs of the speakers maybe in a new column (Obama : politician,lawyer,author)
- QIDS:  
    - add the link to the wikipedia page of the speaker
    - keep only the qids of the speaker
        - need to check if everything OK with the qids (qids speaker =?= speaker)
- Date:
    - Try to keep only the important informations about the date (maybe don't need to keep the minutes)
- Removes columns that we do not need (quotesID, phase, ...?)

## Analysis 

- quids same for each quote 
- check proba avec le speaker 
- check chaque colonne
- verifier l'URL 

**Analysing Selected Speaker vs Highest Probablity Speaker**

Comparing the speaker in the "speaker" column against the one with the highest probability in "probas", outputing the lines with different values for those two, and counting the number of occurences, displaying the highest ones

In [None]:
temp = df.select(df.speaker, df.probas)
temp = temp.withColumn("highest_prob", temp.probas[0])
temp = temp.withColumn("prob_speaker", temp.highest_prob[0])

error_speakers = temp.filter(temp.speaker != temp.prob_speaker).show()

In [None]:
WrongSpeakers = temp.groupBy("prob_speaker").count().withColumnRenamed("prob_speaker", "count")

In [None]:
asc_wrong = WrongSpeakers.sort("count", ascending = False)

In [None]:
asc_wrong.toPandas().to_csv('speakers_count_19.csv')

**Analysing Columns**

Checking for aberrent values in the dataset, each column separately

In [None]:
df.filter(df.date == None).show()
df.filter(df.numOccurrences == None).show()
df.filter(df.phase == None).show()
df.filter(df.probas == None).show()
df.filter(df.qids == None).show()
df.filter(df.quotation == None).show()
df.filter(df.quoteID == None).show()
df.filter(df.speaker == None).show()
df.filter(df.urls == None).show()

In [None]:
# speaker
df_speakers =df.drop_duplicates(subset=['speaker'])

In [None]:
num_diff_speakers = df_speakers.count()

In [None]:
df_names = df_speakers[['speaker']]

In [None]:
df_none = df[df.speaker=='None']

# Note - problem with this cell : Can't extract value from speaker#14: need struct type but got string
df[df['speaker'].str.contains('pokemon')]

In [None]:
df_test = pd.DataFrame(df.head(100), columns=df.columns)
df.filter(df_test.speaker.contains('google.com')).show()

### Cell to take only the speakers ('at' because append between different datasets (years))

In [120]:
path_to_file = 'data/quotes-2019.json.bz2' 
path_to_out = 'data/speakers_19_20.txt.bz2'
k =0
with bz2.open(path_to_file, 'rb') as s_file:
    with bz2.open(path_to_out, 'at') as d_file:
        for instance in s_file:
            instance = json.loads(instance) # loading a sample
            speaker = instance['speaker'] # takes only speakers names
            d_file.write(speaker+'\n')# writing in the new file

**Analysing Number of Occurences**

Looking at the most occuring Quotes

In [None]:
ordered_occurences = df.sort("numOccurrences", ascending = False).show()

#### Analysis on speakers 
- Number of different speakers : 218415

In [122]:
data = pd.read_csv('data/speakers_19_20.txt.bz2', sep="\n", header=None)
data = data[data[0] !='None'].drop_duplicates()

### Special cases in the speaker's name

In [124]:
df_trump = data.loc[data[0].str.contains('Trump',case = False)]
df_trump = df_trump.drop_duplicates()
searchfor = ['Melania', 'Eric','Ivanka','judd','trumpauer','Barron','Lara','Andreas','william','trumper','spencer','blaine','charles','ivana']
df_trump = df_trump[~df_trump[0].str.contains('|'.join(searchfor),case = False)]
df_trump

Unnamed: 0,0
247,Donald Trump
263,President Donald Trump
609,President Trump
13688,Donald Trump Jr. .
15578,Donald Trump Jr
16328,"Donald Trump , Jr. ."
16448,Donald J. Trump
27659,President Donald J. Trump
31087,Donald trump
38768,president Donald Trump


- There are more than 35 different names for Donald Trump, in lower and upper cases

In [129]:
data['len'] = data[0].apply(lambda x : len(x.split()))
data[data['len'] >= 9]

Unnamed: 0,0,len
6766815,"Cristóbal Colón de Carvajal y Gorosábel , 18th...",11
12115980,"David Albert Charles Armstrong-Jones , 2nd Ear...",9
14789104,"Julian Asquith , 2nd Earl of Oxford and Asquith",9
15931964,eating disorders working group of the Psychiat...,9
17341178,"Christopher Walter Monckton , 3rd Viscount Mon...",9
23036249,Eating Disorders Working Group of the Psychiat...,9


In [133]:
df_duchess = data.loc[data[0].str.contains('Duchess',case = False)].drop_duplicates()
df_duchess.head(5)

Unnamed: 0,0,len
3017,"Meghan , the Duchess of Sussex",6
4166,"Camilla , Duchess of Cornwall",5
7622,Duchess of Sussex,3
9489,"Meghan , Duchess of Sussex",5
18938,Duchess Meghan,2


- same problem with the words: Duchess, Minister, Professor, 

In [134]:
df_director = data[data[0].str.contains('director',case = False)].drop_duplicates()
df_director.head(5)

Unnamed: 0,0,len
7259,theater director,2
290334,Director X,2
875877,Theater Director,2
1278777,Theater director,2
3334797,director x,2


- same problem with the word director but when thinking about it, it's not really a name 

In [136]:
df_speacial_cases = data.loc[data[0].str.contains("''",case = False)].drop_duplicates()
df_speacial_cases

Unnamed: 0,0,len
16887,Philip `` Brave '' Davis,5
18367,Nicole `` Snooki '' Polizzi,5
28771,Steve `` Lips '' Kudlow,5
30179,Kent `` Smallzy '' Small,5
50412,Jake `` The Snake '' Roberts,6
...,...,...
25099511,Walter `` Wali '' Jones,5
25123980,Leticia `` Tish '' Cyrus,5
25204213,Vincent `` Rocco '' Vargas,5
25857515,Nancy `` Rusty '' Barceló,5


- some speakers name contain "'"

In [137]:
data[data[0] == "Hey That 's No Way to Say Goodbye"]

Unnamed: 0,0,len
1945236,Hey That 's No Way to Say Goodbye,8


- speakers with strange name 