# Basic data processing

In this notebook the data obtained via the data_fetch.ipynb notebook is used to compute the TF-IDF table, which is in turn saved to tfidf.csv. While the execution of this notebook takes significantly less time than data_fetch.ipynb, the resulting CSV is also available online: https://upm365-my.sharepoint.com/:x:/g/personal/alejandro_alvarezco_alumnos_upm_es/EdrSyZi8X0BAvI0tgkCgB_8BaBiYKFfx-7LM4ruRNw2rHQ?e=STNQy1

In [1]:
import os
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
TXT_PATH = "data/stemmed/"

files = [TXT_PATH + i for i in os.listdir(TXT_PATH)]

min_df = float(2/len(files)) # the term freq. must be of at least 2% for it to be included in the computation of the tf-idf table

vectoriz = TfidfVectorizer(input='filename',token_pattern=r'(?u)\b[A-Za-z][A-Za-z]+\b', decode_error='ignore', stop_words='english', min_df=min_df)
matrix = vectoriz.fit_transform(files)

tf_idf = pd.DataFrame(matrix.toarray(), columns=vectoriz.get_feature_names_out())

In [3]:
tf_idf.head()

Unnamed: 0,aa,aaa,aaab,aaai,aabb,aachen,aacn,aad,aaditya,aae,...,zy,zyes,zygmunt,zym,zyx,zyz,zz,zzt,zzu,zzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.005645,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


After this step, it can be seen that some unwanted words are included in the vocabulary. These may come from URLs that were cut in the process of getting the text from the pdf files, the reference section of most paper, etc. We can perform an additional manual cleanup step in which we remove all columns whose first two letters are equal. We also remove every column whose sum is 0, which in theory shouldn't be possible. In this way, we make sure of it.

Another cleanup step to consider would be to remove every word that does not belong to the English dictionary. While this would indeed get rid of all the unwanted words, some technicisms that may not be included in such dictionary may as well be removed, which could lead to loss of valuable information. For this reason, this step is omitted.

In [4]:
for col in tf_idf.columns:
    if col[0] == col[1]:
        tf_idf.drop(col, axis=1, inplace=True)
    elif tf_idf[col].sum() == 0.0:
        tf_idf.drop(col, axis=1, inplace=True)

In [5]:
tf_idf.head()

Unnamed: 0,ab,aba,abad,abadi,abandon,abarghouei,abat,abati,abavisani,abb,...,zxc,zxi,zxl,zxq,zy,zyes,zygmunt,zym,zyx,zyz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.003689,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
tf_idf.to_csv("data/tfidf.csv", index=False)