In [7]:
import pandas as pd 
%matplotlib inline

This is a notebook with the project overview, explaining important files, notebooks and scripts to be considered when exploring the code.

# WORKFLOW

The data source is available on the following repository https://datorium.gesis.org/xmlui/handle/10.7802/1515. There are two datasources to consider:
* edge-list 
* politician metadata

The **edge-list** directory contains csv files. In each of these files, an edge list with node ID's can be found, noting which nodes are connected by an edge and how is the edge directed. The node from which the edge is directed can be found in the first column, and the node to which the edge is directed can be found in the second column.

In [4]:
pd.read_csv("data/edge-list/2001_06.csv").head()

Unnamed: 0,from,to
0,1143861,81474
1,204249,786492
2,204249,168402
3,421871,168402
4,421871,438519


The metadata file contains information on politicians found in DBPedia. However, not all of (their ID's) them are contained in the edge-list files. For example, in the metadata file, there is a politician with ID xxxx, but it can not be found in any of the edge-list files. Therefore, the metadata file was filtered, and the politician_data_inside was created and saved as a pickle.  

In [5]:
pd.read_pickle("data/politician_data_inside").head()

Unnamed: 0,#DBpURL,ID,WikiURL,birthDate,deathDate,first_name,fpp_gender,full_name,gender,genderize,name,nationality,occupation,party
0,http://dbpedia.org/resource/Quincy_Timberlake,11,http://en.wikipedia.org/wiki/Quincy_Timberlake,1980-04-22 00:00:00,,quincy,,Quincy+Timberlake,male,,"[ quincy timberlake , timberlake quincy ]",[kenyan],[politician],[]
5,http://dbpedia.org/resource/Nizamettin_Erkmen,144,http://en.wikipedia.org/wiki/Nizamettin_Erkmen,,1990-10-24 00:00:00,nizamettin,,Nizamettin+Erkmen,male,,[ erkmen nizamettin ],[turkish],[politician],[]
6,http://dbpedia.org/resource/Claudio_Scajola,183,http://en.wikipedia.org/wiki/Claudio_Scajola,1948-01-15 00:00:00,,claudio,,Claudio+Scajola,male,,[ claudio scajola ],[italian],[politician],[ forza italia (2013) ]
8,http://dbpedia.org/resource/Thomas_Clausen_(Lo...,246,http://en.wikipedia.org/wiki/Thomas_Clausen_(L...,1939-12-22 00:00:00,2002-02-20 00:00:00,thomas,,Thomas+Clausen+(Louisiana),male,,"[ thomas greenwood clausen , clausen thomas g...",[american],[politician],[ democratic party (united states) ]
9,http://dbpedia.org/resource/Yang_Ti-liang,248,http://en.wikipedia.org/wiki/Yang_Ti-liang,1929-06-30 00:00:00,,yang,,Yang+Ti-liang,male,,[ ti-liang yang ],[],[politician],[]


# Steps for preparing the data

1. As we do not need a network for every month, we first need to filter the edge lists according to preference. This is done with the script:
```python
python filter_edgelist_data.py [edge_list_dir_path] [path_save] [interval]
```
Every file contains an edge lists for a specific month, i.e. 2016_5.csv - this is May 2016. Sometimes it is neded to filter out the files to a monthly **interval** 4m (every 4 months),6m (every 6 months),12m (every 12 months). Runing this script one should get all files in specified folder (**edge_list_dir_path**), filter them according to the specified **interval**, and store them in the appropriate directory **path save**


2. Now that the edge-list files are filtered, we can build directed graphs: 

```python
python save_network.py [path_files] [path_save] [graph_type]
```

The script iterates through the specified edge list files (**path_files**) , loads them as networkx graphs, and assignes attributes to all nodes of the graph. Finnaly, it saves each graph in a pickle in the specified location (**save_path**). It is also posible to choose the type of graph (**graph_type**) 'dir' for directed or 'undir' for undirected. Additionaly, a graph size (number of nodes and edges) statistic is generated in the end.

The attributes for each politician are storred on this path: **data/politicians_with_gender**


3. In case of need to filter specific graphs, there is a scrpit: 
```python
python filter_graphs.py [path_copy] [path_paste] [filter.csv]
```
This script copies files from **path_copy** to **path_paste**. The files to be copied are specified in the **filter.csv** file.


4. To calculate the network properties (in-degree, out-degree, eigenvector cenntrality, k-core) for each node and save them as attributes we use the following notebook: **add_stats_to_network.ipynb**


5. To add the **efficiency** attribute to the nodes, we need to calculate the efficiency by using a Python 2 implementation using the graph tool. This contains a few steps and pieces of code 

The script **calculate_efficiency.py** uses **graph_tool** which has Python 2 support only. The methods from this script are used in the notebook: **add_efficiency.ipynb**

6. To add the year when the page was added to wikipedia, we can use the edge-list files, because we can see when an ID has first sown up. Using this data we build cohorts of ID's that have entered Wikipedia in the same year. The cohorts are created and stored using the notebook below:

    **cohorts.ipynb**
    
7. The metadata file stores everything (dates, lists etc.) as strings, this is handled (converted back to proper data type) using:

    **initial_clean.ipynb**
    
8. Finnaly all the additionaly added information can is combined with this notebook:

    **connecting_data.ipynb**

# Exploratory analysis

Different explorations can be found in:
* /exploration/
* /pageviews/pageviews_exploration.ipynb
* /model/explore.ipynb
* exploration/dataset_check.ipynb
* arcticle_length.ipynb

# Kolmogorov-Smirnov Test 

We use this to find the difference betwwen the distributions of network properties (in-degree, out-degree, k-core, eigenvector centrality) using the code in the notebook below:
*  **d_test.ipynb**

# Pageview prediction and Gender Classification

This is done in the following notebooks:
* model/views_prediction.ipynb
* model/gender_clf.ipynb

The prepared data is divided into general data and data per country, so if you want all the data:
* **data/final_sets/model or data/final_sets/model_large**
If you want to train the model on specific countries, then use (the example is for american politicians in 2016): 
* **data/final_sets/countires/2016_american**

# Network Size plots

To produce plots like the one below, use the following:
* network_size.ipynb 

<img src="plots/network_size/Ratio (Global, US, FRA, GER, RUS, GB) and Network size.png">

# Other

Exploration about unidirectional links can be found in the notebook: 
* **exploration/unidirectional_links.ipynb**

Gender detection is done using the software from **https://github.com/gesiscss/image-gender-inference**

Downloading Pageview data is done with the software from: **https://github.com/gesiscss/wiki-download-parse-page-views**

# FILES

In [None]:
NOTEBOOKS