# Exploring the sources of the data

Now that we have the sports quotes extracted, we can investigate on the sources of the quotes. In this notebook we are going to explore year 2020 which is containing the less data. Therefore if this year shows reasonable amount of data, we are expecting all the other years to do so. 

The data are explored by chunk anyway, so the code can be used with bigger datasets without any problem. This is an important step as the code becomes a bit more complicated when exploring the datas by chunks.

This first section of the code is just setting up the environment and loading everything that is necessary for the rest of the analysis. Some downgrading are performed because they are necessary to make the code work on Google Colab, which has been used here.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install tld
!pip install pandas==1.0.5

Collecting tld
  Downloading tld-0.12.6-py37-none-any.whl (412 kB)
[?25l[K     |▉                               | 10 kB 22.3 MB/s eta 0:00:01[K     |█▋                              | 20 kB 26.7 MB/s eta 0:00:01[K     |██▍                             | 30 kB 20.1 MB/s eta 0:00:01[K     |███▏                            | 40 kB 16.8 MB/s eta 0:00:01[K     |████                            | 51 kB 7.4 MB/s eta 0:00:01[K     |████▊                           | 61 kB 8.6 MB/s eta 0:00:01[K     |█████▋                          | 71 kB 8.1 MB/s eta 0:00:01[K     |██████▍                         | 81 kB 9.0 MB/s eta 0:00:01[K     |███████▏                        | 92 kB 9.6 MB/s eta 0:00:01[K     |████████                        | 102 kB 7.3 MB/s eta 0:00:01[K     |████████▊                       | 112 kB 7.3 MB/s eta 0:00:01[K     |█████████▌                      | 122 kB 7.3 MB/s eta 0:00:01[K     |██████████▍                     | 133 kB 7.3 MB/s eta 0:00:01[K     

In [None]:
# Imports you may need
import seaborn as sns
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import numpy as np
from tld import get_tld
from ast import literal_eval

# Sources of the news

Now let's first have a look at the different sources of the news. To do so we will explore each URL of each quotation to extract the domain. To do so we use the package TLD. 

In [None]:
incre = 0 # Counting how many chunks are open
nlines = 0 # Counting how many lines there are in the complete file
nurls = 0 # Counting the number of URL there are
for chunk in pd.read_csv('/content/drive/MyDrive/ada-sports-quotes/sport-quotes-2020.csv.bz2', compression='bz2', converters={'urls': literal_eval}, chunksize=100000):
  incre += 1
  sources = []
  for urls in chunk.urls:
    nlines += 1
    for url in urls:
      nurls += 1
      res = get_tld(url, as_object=True)
      sources.append(res.domain) # Recover all the sources in the chunk
  dfsources = pd.DataFrame() # Dataframe to store the sources and group them
  dfsources['label'] = sources
  dfsources = dfsources.groupby(dfsources.label).size().reset_index(name='counts')
  print(incre)
  if incre == 1:
    ranking = dfsources # Ranking will be the final datafram containing the counting of each unique source
  else:
    ranking = ranking.merge(dfsources, on='label', how='outer').fillna(0) # Merge ranking with dfsources
    ranking['counts'] = ranking.counts_x + ranking.counts_y
    ranking = ranking.drop(columns=['counts_x','counts_y'])
ranking = ranking.sort_values('counts',ascending=False)
print('{} media out of {} quoting of {} unique quotes'.format(ranking.shape[0],nurls,nlines))

1
2
3
4
5
6
7
4676 media out of 3338950 quoting of 641614 unique quotes


Only looking at these results, one can conclude two things : 


*   The number of unique sources (4676) is very small compared to the number of unique quotes (641k) 
*   Certain quotes are used many times as the number of quotes (3M) is huge compared to the number of unique quotes (641k)

Now let's explore the results of "ranking" :



In [None]:
ranking.describe()

Unnamed: 0,counts
count,4676.0
mean,714.061163
std,2639.812904
min,1.0
25%,5.0
50%,26.0
75%,179.25
max,72913.0


One can already conclude from "describe" that the distribution is heavy tailed given that the std is huge compared to the mean which is itself huge compared to the median. Therefore, only few media are publishing a lot of sports quote (probably several times the same sometimes).

In [None]:
ranking.head(10)

Unnamed: 0,label,counts
2016,nbcsports,72913.0
2059,news965,57164.0
1925,msn,45772.0
3467,wokv,40948.0
1598,krmg,34335.0
3510,wsbradio,29831.0
945,eurosport,27672.0
2613,skysports,23418.0
2080,newsok,21538.0
2753,stv,21232.0
