### Requirements

In [1]:
!pip install findspark
!pip install pyspark



In [2]:
import pandas as pd
import bz2
import json
from pyspark.sql import SparkSession
import pyspark.sql.functions as sf
import requests
import findspark

In [3]:
findspark.init('/Users/tatianacogne/spark')

### Objectives M2
- That you can handle the data in its size.
- That you understand what’s in the data (formats, distributions, missing values, correlations, etc.).
- That you considered ways to enrich, filter, transform the data according to your needs.
- That you have a reasonable plan and ideas for methods you’re going to use, giving their essential mathematical details in the notebook.
- That your plan for analysis and communication is reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.

### Test with PySpark
https://spark.apache.org/docs/latest/sql-programming-guide.html

In [4]:
# Create a spark context
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

# Read JSON file into dataframe
df = spark.read.json('data/quotes-2020.json.bz2')

In [5]:
findspark.init() 

In [6]:
df.show()

+-------------------+--------------+-----+--------------------+--------------------+--------------------+-----------------+-------------------+--------------------+
|               date|numOccurrences|phase|              probas|                qids|           quotation|          quoteID|            speaker|                urls|
+-------------------+--------------+-----+--------------------+--------------------+--------------------+-----------------+-------------------+--------------------+
|2020-01-28 08:04:05|             1|    E|[[None, 0.7272], ...|                  []|[ D ] espite the ...|2020-01-28-000082|               None|[http://israelnat...|
|2020-01-16 12:00:13|             1|    E|[[Sue Myrick, 0.8...|           [Q367796]|[ Department of H...|2020-01-16-000088|         Sue Myrick|[http://thehill.c...|
|2020-02-10 23:45:54|             1|    E|[[None, 0.8926], ...|                  []|... He (Madhav) a...|2020-02-10-000142|               None|[https://indianex...|
|2020-02-1

## A .Understanding of what’s in the data (formats, distributions, missing values, correlations, etc.).

### A1. Formats of the data

#### Summary Columns
- **quoteID**:      Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
- **quotation**:    Text of the longest encountered original form of the quotation
- **speaker**:      Selected most likely speaker
- **qids**:         Wikidata IDs of all aliases that match the selected speaker
- **date**:         Earliest occurrence date of any version of the quotation
- **numOccurences**:Number of time this quotation occurs in the articles
- **probas**:       Array representing the probabilities of each speaker having uttered the quotation
- **urls**:         List of links to the original articles containing the quotation
- **phase**:        Corresponding phase of the data in which the quotation first occurred (A-E)
- **domains**:      Domain of the URL 

In [7]:
df.printSchema()

root
 |-- date: string (nullable = true)
 |-- numOccurrences: long (nullable = true)
 |-- phase: string (nullable = true)
 |-- probas: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- qids: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- quotation: string (nullable = true)
 |-- quoteID: string (nullable = true)
 |-- speaker: string (nullable = true)
 |-- urls: array (nullable = true)
 |    |-- element: string (containsNull = true)



### A2. Distributions
- Idea: Distribution of the words ? Or maybe later ?

### A3. Missing Values

### A4. Correlation
- Idea: slides du cours ? Je ne sais pas si on doit déjà commencer à faire ce genre d'analyse

## B. Ways to enrich, filter, transform the data according to your needs.

### B1. Task
- Quotes: need to use different functions in order filter the quotes before analyzing
    - remove stop words (and, the, ...)
    - stemming and lemming the quotes
    - use NLTK function in order to categorize the words in the sentence for example
- Speakers: 
    - need to keep only the speakers different form "None"
        - is it resonnable tp drop the None speakers ? Not a too big percentage of the dataset ? 
    - need to regroup speakers like "President Donald Trump" and "Donald Trump"
    - add columns with the occupations/jobs of the speakers maybe in a new column (Obama : politician,lawyer,author)
- QIDS:  
    - add the link to the wikipedia page of the speaker
    - keep only the qids of the speaker
        - need to check if everything OK with the qids (qids speaker =?= speaker)
- Date:
    - Try to keep only the important informations about the date (maybe don't need to keep the minutes)
- Removes columns that we do not need (quotesID, phase, ...?)

## Analysis 

- quids same for each quote 
- check proba avec le speaker 
- check chaque colonne
- verifier l'URL 

**Analysing Selected Speaker vs Highest Probablity Speaker**

Comparing the speaker in the "speaker" column against the one with the highest probability in "probas", outputing the lines with different values for those two, and counting the number of occurences, displaying the highest ones

In [8]:
temp = df.select(df.speaker, df.probas)
temp = temp.withColumn("highest_prob", temp.probas[0])
temp = temp.withColumn("prob_speaker", temp.highest_prob[0])

error_speakers = temp.filter(temp.speaker != temp.prob_speaker).show()

+-----------------+--------------------+--------------------+--------------------+
|          speaker|              probas|        highest_prob|        prob_speaker|
+-----------------+--------------------+--------------------+--------------------+
|             None|[[Kris Bryant, 0....|[Kris Bryant, 0.4...|         Kris Bryant|
|         Jane Roe|[[None, 0.2695], ...|      [None, 0.2695]|                None|
| Christian Doidge|[[None, 0.1002], ...|      [None, 0.1002]|                None|
|     Heidi Larson|[[None, 0.0614], ...|      [None, 0.0614]|                None|
|             None|[[Rio Ferdinand, ...|[Rio Ferdinand, 0...|       Rio Ferdinand|
|             None|[[Paul Brown, 0.3...|[Paul Brown, 0.3887]|          Paul Brown|
|     Joel Dommett|[[None, 0.0367], ...|      [None, 0.0367]|                None|
|        Ed Turner|[[None, 0.0172], ...|      [None, 0.0172]|                None|
|      Peter Weber|[[None, 0.3477], ...|      [None, 0.3477]|                None|
|   

In [9]:
WrongSpeakers = temp.groupBy("prob_speaker").count().withColumnRenamed("prob_speaker", "count")

In [10]:
asc_wrong = WrongSpeakers.sort("count", ascending = False)

In [11]:
asc_wrong.toPandas().to_csv('speakers_count_19.csv')

**Analysing Columns**

Checking for aberrent values in the dataset, each column separately

In [12]:
df.filter(df.date == None).show()
df.filter(df.numOccurrences == None).show()
df.filter(df.phase == None).show()
df.filter(df.probas == None).show()
df.filter(df.qids == None).show()
df.filter(df.quotation == None).show()
df.filter(df.quoteID == None).show()
df.filter(df.speaker == None).show()
df.filter(df.urls == None).show()

+----+--------------+-----+------+----+---------+-------+-------+----+
|date|numOccurrences|phase|probas|qids|quotation|quoteID|speaker|urls|
+----+--------------+-----+------+----+---------+-------+-------+----+
+----+--------------+-----+------+----+---------+-------+-------+----+

+----+--------------+-----+------+----+---------+-------+-------+----+
|date|numOccurrences|phase|probas|qids|quotation|quoteID|speaker|urls|
+----+--------------+-----+------+----+---------+-------+-------+----+
+----+--------------+-----+------+----+---------+-------+-------+----+

+----+--------------+-----+------+----+---------+-------+-------+----+
|date|numOccurrences|phase|probas|qids|quotation|quoteID|speaker|urls|
+----+--------------+-----+------+----+---------+-------+-------+----+
+----+--------------+-----+------+----+---------+-------+-------+----+

+----+--------------+-----+------+----+---------+-------+-------+----+
|date|numOccurrences|phase|probas|qids|quotation|quoteID|speaker|urls|
+--

In [13]:
# speaker
df_speakers =df.drop_duplicates(subset=['speaker'])

In [14]:
num_diff_speakers = df_speakers.count()

In [15]:
df_names = df_speakers[['speaker']]

In [16]:
df_none = df[df.speaker=='None']

# Note - problem with this cell : Can't extract value from speaker#14: need struct type but got string
df[df['speaker'].str.contains('pokemon')]

**Analysing Number of Occurences**

Looking at the most occuring Quotes

In [None]:
ordered_occurences = df.sort("numOccurrences", ascending = False).show()

#### Analysis on speakers 
- Number of different speakers : 218415