# Data Fetching and Cleaning

## Introduction

In this notebook we will fetch the dataset and clean the necessary parts

### Infoemation about the Dataset

Dataset is based on the AMiner V12 database. The database can be downloaded from https://www.aminer.org/citation

The Predictors (Inputs): 
The Target (Output): 'n_citation' (number of citations)


### Adding the libraries

In [None]:
import pandas as pd
import matplotlib as plt
import ast
import numpy as np
import seaborn as sns
from statistics import mean
import pickle

### Adding the Database

In [None]:
data_papers = pd.read_pickle('../Data/papers0-2010.pkl')
data_papers.head()

Lets Analyse each column:

In [None]:
print("The shape of the data is: ", str(data_papers.shape))
data_papers.dtypes

## Data Analysis and Cleaning

most of the data consist of Integers and Strings which is analyzed as object. Moreover, we know that n_citation is the type int64 but due to some data inconsistancies it is showing as an Object. We will try to clean the data first.

In [None]:
# first checking for Null values
data_papers['n_citation'].isnull().sum()

# fill the null vlaue with zero (0)
data_papers = data_papers.dropna(subset = ['n_citation','year'])

# check the data agian for null value
data_papers['n_citation'].isnull().sum()

One of the rows had the value of 'object' instead of the actual n_citaion value. We will drop that row.

In [None]:
data_papers = data_papers.drop(data_papers.loc[data_papers['n_citation'] == 'Journal'].index)

Some of the rows have the citaion value in String format while others are in the Int format. We change all the values into integers for consistancy

In [None]:
data_papers['n_citation'] = pd.to_numeric(data_papers['n_citation'])
data_papers['n_citation']

In [None]:
data_papers.dtypes

Now we see that the n_ciation type is an int64. Lets understand the basic statistical details like count, percentile, mean, std, max and min vlaues for the number of citaions.

In [None]:
data_papers.info()

In [None]:
data_papers['n_citation'].describe()

Looking at the data, the mean value for citations is about 13 while the median is 3. this shows the data is fairly 
skewed. It shows half of the papers have lower than 3 citaions while 75% of the data have lower than 9 citation.

- We suspect the max value 42080 can be an outlier


Lets look into the largest values of the dataset and how many times they were repeated.

In [None]:
top = data_papers['n_citation'].sort_values(ascending=False).head(30)
top_dict = list(zip(top.index, top.values))
top_dict

Plot the Citations vs number they have been used

In [None]:
top_df = pd.DataFrame(top_dict, columns =['n_citation', 'Frequency']) 
sns.lmplot( x='n_citation', y='Frequency', data = top_df)

Lets Also Analyze the number of citations vs. the year of publication.

In [None]:
sns.boxplot(x=data_papers['n_citation'])

Here we can visually see that most of the citations are close to 3-10 and there are skewed data between 10-15000. There is also a surge of data in abour 40000 citations

### Analyzing the other columns

#### analyze the column 'year'

In [None]:
data_papers['year'].head(3)

The years are represented as float64 format. We try to change them into int64.

In [None]:
column_name = 'year'

# first check for null value in the column
print(data_papers[data_papers[column_name].isnull()])

# change the data type into int64
data_papers[column_name] = data_papers[column_name].astype(np.int64)

# check for unique values
data_papers[column_name].unique()

#### Alias_ids
After checking the alias_ids we concluded that it is not a consistant attribute in the database. there are 480 instances in a 80k sample.

#### Fos
fos has 385 NaN values which was decided to be dropped from the dataset.

#### Venue
venue has 784 NaN values and it was decided to be dropped from the dataset

#### analyze the column fos

In [None]:
print(data_papers['fos'].isnull().sum())
data_papers = data_papers.dropna(subset = ['fos'])

In [None]:
print(data_papers['venue'].isnull().sum())
data_papers = data_papers.dropna(subset = ['venue'])

saving the data as pickel for next section

In [None]:
pickle_path = '../Data/papers0-2010_clean.pkl'
data_papers.to_pickle(pickle_path)