Now let's begin by organizing (AKA cleaning and pre-processing) the titles (headers) of our articles.

# 1. We import the libraries

In [None]:
import pandas as pd
import re

# 2. We get the data

In [None]:
data = pd.read_csv("raw_data.csv")

In [None]:
data

# 3. We split the title to get the CS indentifier

The way in which we are going to be able to match data (titles and articles) with metadata is by doing a match between the CS identifier in both dataframes. So: we need to extract that from the titles of the articles in here.

In [None]:
title = data["Title"].to_list()

In [None]:
title

First we split things by "CS" (an alternative way would be to do this using regex but it's much more complicated)

In [None]:
result = [s.split('CS') for s in title]

In [None]:
result

And now we need to add CS again to make sure that we can later on concatenate it with the Metadata.

In [None]:
modified_data = [[inner[0], 'CS' + inner[1]] for inner in result]

In [None]:
modified_data

And now we need to get rid of the final .txt to be able to later on match things with the metadata dataframe

In [None]:
cleaned_data = [[item[0], item[1].replace('.txt', '')] for item in modified_data]

In [None]:
cleaned_data

In [None]:
len(cleaned_data)

# 4. And now we create a new CSV data frame with a new column: Article ID

First we break that list into two different ones

In [None]:
title = [i[0] for i in cleaned_data]

In [None]:
len(title)

In [None]:
id_articles = [i[1] for i in cleaned_data]

In [None]:
len(id_articles)

And now we create the new csv

In [None]:
final_data = pd.DataFrame(title, columns = ["Title"])

In [None]:
final_data

In [None]:
final_data["ID"] = id_articles

In [None]:
final_data

And now we link that to the original dataframe with the proper text

In [None]:
text = data["Content"].to_list()

In [None]:
text

In [None]:
final_data["Article"] = text

In [None]:
final_data

So now we have our clean dataset!

# 5. We export everything into a csv file

In [None]:
final_data.to_csv("headers.csv")