### Requirements

In [1]:
!pip install findspark
!pip install pyspark



In [2]:
!pip install request 

ERROR: Could not find a version that satisfies the requirement request
ERROR: No matching distribution found for request


In [1]:
import pandas as pd
import bz2
import json
from pyspark.sql import SparkSession
import pyspark.sql.functions as sf
import requests
import findspark

In [None]:
findspark.init('/Users/tatianacogne/spark')

### Objectives M2
- That you can handle the data in its size.
- That you understand what’s in the data (formats, distributions, missing values, correlations, etc.).
- That you considered ways to enrich, filter, transform the data according to your needs.
- That you have a reasonable plan and ideas for methods you’re going to use, giving their essential mathematical details in the notebook.
- That your plan for analysis and communication is reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.

### Test with PySpark
https://spark.apache.org/docs/latest/sql-programming-guide.html

In [2]:
# Create a spark context
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

# Read JSON file into dataframe
df = spark.read.json('data/quotes-2020.json.bz2')

In [None]:
findspark.init() 

### Summary Columns
- **quoteID**:      Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
- **quotation**:    Text of the longest encountered original form of the quotation
- **speaker**:      Selected most likely speaker
- **qids**:         Wikidata IDs of all aliases that match the selected speaker
- **date**:         Earliest occurrence date of any version of the quotation
- **numOccurences**:Number of time this quotation occurs in the articles
- **probas**:       Array representing the probabilities of each speaker having uttered the quotation
- **urls**:         List of links to the original articles containing the quotation
- **phase**:        Corresponding phase of the data in which the quotation first occurred (A-E)
- **domains**:      Domain of the URL 

In [6]:
df.printSchema()

root
 |-- date: string (nullable = true)
 |-- numOccurrences: long (nullable = true)
 |-- phase: string (nullable = true)
 |-- probas: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- qids: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- quotation: string (nullable = true)
 |-- quoteID: string (nullable = true)
 |-- speaker: string (nullable = true)
 |-- urls: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [7]:
df.show()

+-------------------+--------------+-----+--------------------+--------------------+--------------------+-----------------+-------------------+--------------------+
|               date|numOccurrences|phase|              probas|                qids|           quotation|          quoteID|            speaker|                urls|
+-------------------+--------------+-----+--------------------+--------------------+--------------------+-----------------+-------------------+--------------------+
|2020-01-28 08:04:05|             1|    E|[[None, 0.7272], ...|                  []|[ D ] espite the ...|2020-01-28-000082|               None|[http://israelnat...|
|2020-01-16 12:00:13|             1|    E|[[Sue Myrick, 0.8...|           [Q367796]|[ Department of H...|2020-01-16-000088|         Sue Myrick|[http://thehill.c...|
|2020-02-10 23:45:54|             1|    E|[[None, 0.8926], ...|                  []|... He (Madhav) a...|2020-02-10-000142|               None|[https://indianex...|
|2020-02-1

# Tasks

## Analysis 

- quids same for each quote 
- check proba avec le speaker 
- check chaque colonne
- verifier l'URL 

**Analysing Selected Speaker vs Highest Probablity Speaker**

Comparing the speaker in the "speaker" column against the one with the highest probability in "probas", outputing the lines with different values for those two, and counting the number of occurences, displaying the highest ones

In [41]:
temp = df.select(df.speaker, df.probas)
temp = temp.withColumn("highest_prob", temp.probas[0])
temp = temp.withColumn("prob_speaker", temp.highest_prob[0])

error_speakers = temp.filter(temp.speaker != temp.prob_speaker).show()

+-----------------+--------------------+--------------------+------------------+
|          speaker|              probas|        highest_prob|      prob_speaker|
+-----------------+--------------------+--------------------+------------------+
|             None|[[Gemma Collins, ...|[Gemma Collins, 0...|     Gemma Collins|
|             None|[[Craig Pittman, ...|[Craig Pittman, 0...|     Craig Pittman|
|             None|[[Emma Mitchell, ...|[Emma Mitchell, 0...|     Emma Mitchell|
|             None|[[Moses Elisaf, 0...|[Moses Elisaf, 0....|      Moses Elisaf|
|      Amber Heard|[[None, 0.1885], ...|      [None, 0.1885]|              None|
|             None|[[Ian McIntosh, 0...|[Ian McIntosh, 0....|      Ian McIntosh|
|   Naomi Campbell|[[None, 0.307], [...|       [None, 0.307]|              None|
|      Tom DeLonge|[[None, 0.0275], ...|      [None, 0.0275]|              None|
|             None|[[Jair Bolsonaro,...|[Jair Bolsonaro, ...|    Jair Bolsonaro|
|     Donald Trump|[[None, 0

In [42]:
WrongSpeakers = temp.groupBy("prob_speaker").count().withColumnRenamed("prob_speaker", "count")

In [47]:
asc_wrong = WrongSpeakers.sort("count", ascending = False)

In [48]:
asc_wrong.toPandas().to_csv('speakers_count_19.csv')

**Analysing Columns**

Checking for aberrent values in the dataset, each column separately

In [49]:
df.filter(df.date == None).show()
df.filter(df.numOccurrences == None).show()
df.filter(df.phase == None).show()
df.filter(df.probas == None).show()
df.filter(df.qids == None).show()
df.filter(df.quotation == None).show()
df.filter(df.quoteID == None).show()
df.filter(df.speaker == None).show()
df.filter(df.urls == None).show()

+----+--------------+-----+------+----+---------+-------+-------+----+
|date|numOccurrences|phase|probas|qids|quotation|quoteID|speaker|urls|
+----+--------------+-----+------+----+---------+-------+-------+----+
+----+--------------+-----+------+----+---------+-------+-------+----+

+----+--------------+-----+------+----+---------+-------+-------+----+
|date|numOccurrences|phase|probas|qids|quotation|quoteID|speaker|urls|
+----+--------------+-----+------+----+---------+-------+-------+----+
+----+--------------+-----+------+----+---------+-------+-------+----+

+----+--------------+-----+------+----+---------+-------+-------+----+
|date|numOccurrences|phase|probas|qids|quotation|quoteID|speaker|urls|
+----+--------------+-----+------+----+---------+-------+-------+----+
+----+--------------+-----+------+----+---------+-------+-------+----+

+----+--------------+-----+------+----+---------+-------+-------+----+
|date|numOccurrences|phase|probas|qids|quotation|quoteID|speaker|urls|
+--

In [None]:
# speaker
df_speakers =df.drop_duplicates(subset=['speaker'])

In [None]:
num_diff_speakers = df_speakers.count()

In [None]:
df_names = df_speakers[['speaker']]

In [39]:
df_none = df[df.speaker=='None'].show()

+-------------------+--------------+-----+--------------------+----+--------------------+-----------------+-------+--------------------+
|               date|numOccurrences|phase|              probas|qids|           quotation|          quoteID|speaker|                urls|
+-------------------+--------------+-----+--------------------+----+--------------------+-----------------+-------+--------------------+
|2020-01-28 08:04:05|             1|    E|[[None, 0.7272], ...|  []|[ D ] espite the ...|2020-01-28-000082|   None|[http://israelnat...|
|2020-02-10 23:45:54|             1|    E|[[None, 0.8926], ...|  []|... He (Madhav) a...|2020-02-10-000142|   None|[https://indianex...|
|2020-02-15 14:12:51|             2|    E|[[None, 0.581], [...|  []|... [ I ] f it ge...|2020-02-15-000053|   None|[https://patriotp...|
|2020-02-27 08:27:00|             1|    E|[[None, 0.7164], ...|  []|[ one's ] individ...|2020-02-27-000223|   None|[https://ukhumanr...|
|2020-04-15 17:30:45|             1|    E

In [38]:
#CHECK: speaker == None don't have qids --> OK 
df_none.filter(sf.size('qids') > 0).show()

+----+--------------+-----+------+----+---------+-------+-------+----+
|date|numOccurrences|phase|probas|qids|quotation|quoteID|speaker|urls|
+----+--------------+-----+------+----+---------+-------+-------+----+
+----+--------------+-----+------+----+---------+-------+-------+----+



**Analysing Number of Occurences**

Looking at the most occuring Quotes

In [None]:
ordered_occurences = df.sort("numOccurrences", ascending = False).show()

In [None]:
def valid_urls(urls):
    bool_urls = []
    for u in urls:
        bool_urls.append(requests.get(u).ok)
    return any(bool_urls)

In [None]:
#CHECK: valid URL (not working)
df.select(filter("urls", valid_urls))

In [None]:
a= sf.split(df['spearker'], ' ')
#df[df.speaker.str.contains('Trump')]

In [None]:
url = 'https://stackoverflow.com/questions/54087303/python-requests-how-to-check-for-200-ok'
requests.get(url).ok

#### Analysis on speakers 
- Number of different speakers : 218415

# Draft 

#### Test with DataFrame (DON'T RUN THIS CELL)

In [None]:
%%time
#df_quotes_2020 = pd.read_json('data/quotes-2020.json.bz2', compression='bz2',lines=True)