The dataset we created has a column called <code>sources</code> that is a list of the sources that reported on that event. For our analysis, we need to look at each individual source, so we need to create a modified version of the dataset that has a row for each individual source that reported on the event. For example, one row may contain three different sources that reported on the event. This process will expand that one row to three rows, each for one of the three different sources.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

We first read in our local CSV file of the dataset we pulled from the ACLED API.

In [2]:
df = pd.read_csv('../data/acled_covid19.csv')

Then, we get a list of all the distinct individual sources that are in our dataset. We turn this list into a pandas DataFrame for our next steps.

In [3]:
# First, we get a collective list of sources
sources_list = list()
for s in df['source']:
    for i in s.split(';'):
        sources_list.append(i.strip())
        
# From our collective list of sources we need a pandas DataFrame of distinct sources
sources_distinct = list(set(sources_list))
sources_distinct_df = pd.DataFrame({'source_singular': sources_distinct})

Finally, we join the pandas DataFrame of distinct individual sources to our original COVID-19 dataset (read in as another pandas DataFrame) on the clause that the <code>source</code> column contains the individual source. To do this, we use the <code>pandasql</code> library.

In [4]:
# since eventually we're using a LIKE clause for the join, we need to add percentage wildcards around each distinct source
# here in our pandas DataFrame, because we can't in pandasql.
sources_distinct_df['source_singular'] = sources_distinct_df['source_singular'].apply(lambda x: f'%{x}%')

# Second, join this distinct source df with the main df. I prefer using pandasql because of the like clause.
from pandasql import sqldf
sql = lambda q: sqldf(q, globals())

expanded_source_df = sql(
'''
    SELECT * FROM df main
    JOIN sources_distinct_df dst_src
    ON main.source LIKE dst_src.source_singular
'''
)

# We can now remove the percentage wildcards from the source_singular column, since we only needed them for the previous step.
expanded_source_df['source_singular'] = expanded_source_df['source_singular'].apply(lambda x: x.replace('%', ''))
expanded_source_df.head()

Unnamed: 0.1,Unnamed: 0,data_id,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,...,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3,source_singular
0,0,9498574,862,VEN12964,12964,2022-09-17,2022,1,Protests,Peaceful protest,...,8.1292,-63.5409,1,Diario Primicia,Subnational,"On 17 September 2022, in Ciudad Bolivar (Boliv...",0,1664226314,VEN,Diario Primicia
1,1,9491030,410,KOR25174,25174,2022-09-16,2022,1,Protests,Peaceful protest,...,37.4744,127.0304,1,EDaily,National,"On 16 September 2022, members of the All-Korea...",0,1663685720,KOR,EDaily
2,2,9491038,156,CHN12137,12137,2022-09-15,2022,1,Protests,Peaceful protest,...,22.2811,114.1598,1,HK01,Subnational,"On 15 September 2022, three representatives of...",0,1663685720,CHN,HK01
3,3,9491260,410,KOR25204,25204,2022-09-15,2022,1,Protests,Peaceful protest,...,37.5223,126.9075,1,YNA,National,"On 15 September 2022, members of the COVID-19 ...",0,1663685720,KOR,YNA
4,4,9492137,250,FRA18626,18626,2022-09-15,2022,1,Protests,Peaceful protest,...,43.2951,-0.3708,1,France Bleu,National,"On 15 September 2022, around 30 opponents of c...",0,1663691322,FRA,France Bleu


A few singular sources are incorrectly matching to certain set sources. Singular sources with names that are in other singular
sources are incorrectly matching.

Examples:
- the source ERR is matching to any source set with 'Osterreich' in its name.
- the source Today is matching to Parma Today and Milano Today

To correct this, we create a column called <code>source_list</code>, and we remove rows where <code>source_singular</code> is not an element of <code>source_list</code>.

In [5]:
expanded_source_df['source_list'] = expanded_source_df['source'].str.split(';')
mismatch_count = len(
    expanded_source_df[expanded_source_df.apply(lambda x: x.source_singular not in [s.strip() for s in x.source_list], axis=1)]
)
match_count = len(
    expanded_source_df[expanded_source_df.apply(lambda x: x.source_singular in [s.strip() for s in x.source_list], axis=1)]
)
print(f'There are {match_count} correctly matched rows and {mismatch_count} incrorrectly matched rows.')

There are 87096 correctly matched rows and 23088 incrorrectly matched rows.


In [6]:
final_expanded_source_df = expanded_source_df[
    expanded_source_df.apply(lambda x: x.source_singular in [s.strip() for s in x.source_list], axis=1)
]
final_expanded_source_df.to_csv('EXPANDED_acled_covid19.csv')

To verify that we have successfully created multiple rows for each source, we can count the number of individual sources per row in our originial dataset and compare that to the length of our new expanded dataset.

In [7]:
# df['source_count'] = df['source'].str.split(';').str.len()
# print(f'Total individual sources in original dataset: {sum(df.source_count)}')
# print(f'Length of new dataset: {len(final_expanded_source_df)}')