# Cleaning/Formatting Website Data
For this section, I will work on performing at least 5 data transformation and/or cleansing steps to Website Data.

## Load necessary libraries

In [3]:
import pandas as pd
import os
import numpy as np
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")

import requests
from bs4 import BeautifulSoup
pd.set_option('display.max_colwidth', None)
import re

## Review the html page, extract related data, and creating DataFrame from the data
The world event wikipedia page has the html url in the format of 'https://en.wikipedia.org/wiki/<year>' for which I can enter different year number for extraction of the major world events of the year. Since last time we extracted the US economy data from the Federal Reserve Bank of St.Louis from 01/01/2020 to 09/01/2024, I will align with this date range in this section.

Let's try and work on the 2020 data first.

In [6]:
# Fetch the HTML content
url = "https://en.wikipedia.org/wiki/2020"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

Using ctrl + u to open the source of the webpage, the 'Events' section is on line 1156 with h2 id="Events", which will be helpful for me to extract the related data.

In [8]:
# Look for the Events section
events_section_header = soup.find("h2", id = "Events")
if events_section_header:
    events_section = events_section_header.find_next("ul")
else:
    raise Exception("Could not find the 'Events' section on the page.")

According to the source, each month has h3 tag with an ID, and each day's events are listed in nested \<ul\> and \<li\> tags, with multiple events possible per day. For days with only one event, the event description is part of the \<li\> tag itself, instead of a nested \<ul\> with multiple \<li\> tags. We will need to handle both formats: one for days with a single events and another for days with multiple events.

Another difficulty is that for instance January 2, 2020 that had one eventof that day, the event text from the source contains multiple links \(\<a\> tags\), and simply extracting the next sibling text will not be efficient to capture the entire event description. Therefore, I will need to extract the entire content of the \<li\> tag for single events, including the text inside nested tags like \<a\> and then concatenate all text within the tag.

In [10]:
# Extract the dates and event text

# Create a list of all months in the 'Events' section
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]

data = []
year = 2020 # Let's only explore the year of 2020 for the moment before we move on with other years

# Loop through each month and extract events
for month in months:
    month_header = soup.find("h3", id = month) # Find the <h3> tag for the current month

    if month_header:
        events_section = month_header.find_next("ul") # Find the next <ul> that contains the events for this month

        # Loop through each <li> element inside the <ul> (this is where the events are listed)
        for li in events_section.find_all("li", recursive=False):
            date_tag = li.find("a") # The <a> tag inside <li> contains the date link
            if date_tag:
                date = date_tag.get_text() # Get the date 

                # Check if there are multiple events (nested <ul> inside the <li>)
                event_list = li.find("ul")
                if event_list:
                    for event_li in event_list.find_all("li"): # Multiple events for this day, loop through each event <li>
                        event_text = event_li.get_text()

                        data.append({
                            "year": year,
                            "month": month,
                            "date": date,
                            "event": event_text.strip()  # Strip the text to remove unnecessary white spaces
                        })
                else:  # For single event dates as there's no "ul"
                    event_text = ' '.join(li.stripped_strings) # Combine all text and links in the <li> tag
                    event_text = event_text.replace(f'{date} – ','')  # Remove the data part from the event text
                    if event_text:
                        data.append({
                            "year": year,
                            "month": month,
                            "date": date,
                            "event": event_text.strip() 
                        })

# Convert the data to DataFrame
df = pd.DataFrame(data)  

In [11]:
# Validate the data, focusing on January
df[df.month=='January']

Unnamed: 0,year,month,date,event
0,2020,January,January 1,Croatia begins its term in the presidency of the European Union.[6]
1,2020,January,January 1,"Flash floods struck Jakarta, Indonesia, killing 66 people in the worst flooding in over a decade.[7]"
2,2020,January,January 2,The Royal Australian Air Force and Navy are deployed to New South Wales and Victoria to assist mass evacuation efforts amidst the 2019–20 Australian bushfire season . [ 8 ] [ 9 ]
3,2020,January,January 3,"A United States drone strike at Baghdad International Airport kills ten people, including the intended target, an Iranian general. Qasem Soleimani and Iraqi paramilitary leader Abu Mahdi al-Muhandis . [ 10 ]"
4,2020,January,January 5,Second Libyan Civil War: President Recep Tayyip Erdoğan announces the deployment of Turkish troops to Libya on behalf of the United Nations-backed Government of National Accord.[11]
5,2020,January,January 5,"2019–20 Croatian presidential election: The second round of voting is held, and Zoran Milanović of the Social Democratic Party of Croatia defeats incumbent president Kolinda Grabar-Kitarović.[12]"
6,2020,January,January 8,"Iran launches ballistic missiles at two Iraqi military bases hosting U.S. soldiers, injuring over 100 personnel.[13]"
7,2020,January,January 8,"Ukraine International Airlines Flight 752 was shot down by Iranian forces shortly after takeoff from Tehran Imam Khomeini International Airport, killing all 176 people on board.[14]"
8,2020,January,January 9,"A rare, circumbinary planet called TOI 1338-b is discovered.[15]"
9,2020,January,January 9,"Islamic State millitants in the Greater Sahara assaulted a Nigerien military base in Chinagodrar, killing at least 89 Nigerien soldiers.[16]"


According to my mannual validation above, the data output returns expected results and we are good to move on with populating the data from the next four years: from 2021 to 2024 up until September 1st 2024.

In [13]:
# Create a function to fetch and parse event data for a given year from Wikipedia's world events
def extract_events_for_year(year, end_month=None, end_day=None, include_day=True):
    url = f"https://en.wikipedia.org/wiki/{year}" # Set up the url so the year input is flexible
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # Define the list of months and limit if needed
    months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
    
    if end_month:  # Set up the month option for 2024 September 1st later
        end_month_idx = months.index(end_month) + 1
        months = months[:end_month_idx]

    data = []

    # Loop through each month and extract events
    for month in months:
        month_header = soup.find("h3", id=month) # Find the <h3> tag for the current month
        
        if month_header:
            events_section = month_header.find_next("ul") # Find the next <ul> that contains the events for this month
            
            # Loop through each <li> element inside the <ul> (this is where the events are listed)
            for li in events_section.find_all("li", recursive=False):  # recursive=False to avoid nested lists
                date_tag = li.find("a")  # The <a> tag inside <li> contains the date link
                if date_tag:
                    date = date_tag.get_text()  # Get the date 
                    
                    # Check if we need to stop on a specific day
                    if end_month and month == end_month and end_day and date == f"{end_month} {end_day}" and not include_day:
                        return data
                    
                    # Check if there are multiple events (nested <ul> inside the <li>)
                    event_list = li.find("ul")
                    if event_list:
                        for event_li in event_list.find_all("li"): # Multiple events for this day, loop through each event <li>
                            event_text = event_li.get_text() 
                           
                            data.append({
                                "year": year,
                                "month": month,
                                "date": date,
                                "event": event_text.strip()  
                            })
                    else: # For single event dates as there's no "ul"
                        event_text = ' '.join(li.stripped_strings)  # Combine all text and links in the <li> tag
                        event_text = event_text.replace(f'{date} – ', '')  # Remove the date part from the event text
                        if event_text:
                            data.append({
                                "year": year,
                                "month": month,
                                "date": date,
                                "event": event_text.strip()  
                            })

                     # Check if we've reached the exact end day
                    if end_month and month == end_month and end_day and date == f"{end_month} {end_day}" and include_day:
                        return data
    return data

In [14]:
# Collect data for 2021, 2022, 2023
all_data = []
for year in range(2021, 2024):
    year_data = extract_events_for_year(year)
    all_data.extend(year_data)

In [15]:
# Collect data for 2024 up to September 1, inclusive
year_2024_data = extract_events_for_year(2024, end_month="September", end_day=1, include_day=True)
all_data.extend(year_2024_data)

In [16]:
# Create the DataFrame and append to the previou 2020 dataframe
df = pd.concat([df, pd.DataFrame(all_data)], ignore_index=True)

In [17]:
# Data validation
print(df.head(10))
print("*" * 80)
print(df.tail(10))
print("*" * 80)
print(f'duplicated event fields: {df[df.event.duplicated()].shape[0]}')

   year    month       date   
0  2020  January  January 1  \
1  2020  January  January 1   
2  2020  January  January 2   
3  2020  January  January 3   
4  2020  January  January 5   
5  2020  January  January 5   
6  2020  January  January 8   
7  2020  January  January 8   
8  2020  January  January 9   
9  2020  January  January 9   

                                                                                                                                                                                                             event  
0                                                                                                                                              Croatia begins its term in the presidency of the European Union.[6]  
1                                                                                                             Flash floods struck Jakarta, Indonesia, killing 66 people in the worst flooding in over a decade.[7]  
2                  

The dataframe didn't stop at September 1st, becasue on that specific date, there's no world event, and the code cuts off at September of the year. This is fine as we will use join later. 

## Data Transformations

### Step 1: String field clean up
The event column, although now matching the source data from the wikipedia, does have some unnecessary characters, such as citation numbers in the square brackets. Let's clean up the field by getting rid of the noise characters, multiple spaces, and convert the text to lowercase for normalization useful for sentiment analysis.

In [21]:
# Define a funciton to clean up the 'event' text
def clean_event_text(text):
    text = re.sub(r'\[\s*\d+\s*\]','',text) # Remove the citation number liks [6], [ 8 ], etc.
    text = re.sub(r'\s+',' ', text).strip() # Remove extra spaces and trim the text
    text = text.lower()
    return text

In [22]:
# Apply the cleaning function to the 'event' column
df['event_cleaned'] = df.event.apply(clean_event_text)
df.head()

Unnamed: 0,year,month,date,event,event_cleaned
0,2020,January,January 1,Croatia begins its term in the presidency of the European Union.[6],croatia begins its term in the presidency of the european union.
1,2020,January,January 1,"Flash floods struck Jakarta, Indonesia, killing 66 people in the worst flooding in over a decade.[7]","flash floods struck jakarta, indonesia, killing 66 people in the worst flooding in over a decade."
2,2020,January,January 2,The Royal Australian Air Force and Navy are deployed to New South Wales and Victoria to assist mass evacuation efforts amidst the 2019–20 Australian bushfire season . [ 8 ] [ 9 ],the royal australian air force and navy are deployed to new south wales and victoria to assist mass evacuation efforts amidst the 2019–20 australian bushfire season .
3,2020,January,January 3,"A United States drone strike at Baghdad International Airport kills ten people, including the intended target, an Iranian general. Qasem Soleimani and Iraqi paramilitary leader Abu Mahdi al-Muhandis . [ 10 ]","a united states drone strike at baghdad international airport kills ten people, including the intended target, an iranian general. qasem soleimani and iraqi paramilitary leader abu mahdi al-muhandis ."
4,2020,January,January 5,Second Libyan Civil War: President Recep Tayyip Erdoğan announces the deployment of Turkish troops to Libya on behalf of the United Nations-backed Government of National Accord.[11],second libyan civil war: president recep tayyip erdoğan announces the deployment of turkish troops to libya on behalf of the united nations-backed government of national accord.


### Step 2: Date columns concatenation and type conversion
Since the date column does not contain year data, and the year column does not contain the date, we will need to concatenate them and convert the datatype to datetime for join with the other two data resources for analysis later.

In [24]:
# Check the data type of each column
df.dtypes

year              int64
month            object
date             object
event            object
event_cleaned    object
dtype: object

In [25]:
# Combine the year and date columns in the format 'year-month-day'
df['event_date'] = df.year.astype(str) + ' ' + df.date
df.head()

Unnamed: 0,year,month,date,event,event_cleaned,event_date
0,2020,January,January 1,Croatia begins its term in the presidency of the European Union.[6],croatia begins its term in the presidency of the european union.,2020 January 1
1,2020,January,January 1,"Flash floods struck Jakarta, Indonesia, killing 66 people in the worst flooding in over a decade.[7]","flash floods struck jakarta, indonesia, killing 66 people in the worst flooding in over a decade.",2020 January 1
2,2020,January,January 2,The Royal Australian Air Force and Navy are deployed to New South Wales and Victoria to assist mass evacuation efforts amidst the 2019–20 Australian bushfire season . [ 8 ] [ 9 ],the royal australian air force and navy are deployed to new south wales and victoria to assist mass evacuation efforts amidst the 2019–20 australian bushfire season .,2020 January 2
3,2020,January,January 3,"A United States drone strike at Baghdad International Airport kills ten people, including the intended target, an Iranian general. Qasem Soleimani and Iraqi paramilitary leader Abu Mahdi al-Muhandis . [ 10 ]","a united states drone strike at baghdad international airport kills ten people, including the intended target, an iranian general. qasem soleimani and iraqi paramilitary leader abu mahdi al-muhandis .",2020 January 3
4,2020,January,January 5,Second Libyan Civil War: President Recep Tayyip Erdoğan announces the deployment of Turkish troops to Libya on behalf of the United Nations-backed Government of National Accord.[11],second libyan civil war: president recep tayyip erdoğan announces the deployment of turkish troops to libya on behalf of the united nations-backed government of national accord.,2020 January 5


In [26]:
# Convert the full date to datetime format
df.event_date = pd.to_datetime(df.event_date, format='%Y %B %d', errors='coerce')
df.head()

Unnamed: 0,year,month,date,event,event_cleaned,event_date
0,2020,January,January 1,Croatia begins its term in the presidency of the European Union.[6],croatia begins its term in the presidency of the european union.,2020-01-01
1,2020,January,January 1,"Flash floods struck Jakarta, Indonesia, killing 66 people in the worst flooding in over a decade.[7]","flash floods struck jakarta, indonesia, killing 66 people in the worst flooding in over a decade.",2020-01-01
2,2020,January,January 2,The Royal Australian Air Force and Navy are deployed to New South Wales and Victoria to assist mass evacuation efforts amidst the 2019–20 Australian bushfire season . [ 8 ] [ 9 ],the royal australian air force and navy are deployed to new south wales and victoria to assist mass evacuation efforts amidst the 2019–20 australian bushfire season .,2020-01-02
3,2020,January,January 3,"A United States drone strike at Baghdad International Airport kills ten people, including the intended target, an Iranian general. Qasem Soleimani and Iraqi paramilitary leader Abu Mahdi al-Muhandis . [ 10 ]","a united states drone strike at baghdad international airport kills ten people, including the intended target, an iranian general. qasem soleimani and iraqi paramilitary leader abu mahdi al-muhandis .",2020-01-03
4,2020,January,January 5,Second Libyan Civil War: President Recep Tayyip Erdoğan announces the deployment of Turkish troops to Libya on behalf of the United Nations-backed Government of National Accord.[11],second libyan civil war: president recep tayyip erdoğan announces the deployment of turkish troops to libya on behalf of the united nations-backed government of national accord.,2020-01-05


In [27]:
print('minimal value of the event_date column: ', df.event_date.min())
print('maximum value of the event_date column: ', df.event_date.max())
print('data type of the event_date column: ', df.event_date.dtypes)

minimal value of the event_date column:  2020-01-01 00:00:00
maximum value of the event_date column:  2024-09-30 00:00:00
data type of the event_date column:  datetime64[ns]


### Step 3: Handling null values and duplications
Let's check if there's any null values in the dataframe and duplications.

In [29]:
# Check the number of null values in each column
df.isnull().sum()

year             0
month            0
date             0
event            0
event_cleaned    0
event_date       8
dtype: int64

As can be seen from the above, there are 8 records that don't have valid event_date values. We need to take care of the records.

In [31]:
df[df.event_date.isna()]

Unnamed: 0,year,month,date,event,event_cleaned,event_date
156,2020,July,28 July,"Former Prime Minister of Malaysia Najib Razak is found guilty of all seven charges in the first of five trials on the 1MDB scandal , being jailed 12 years and fined RM 210 million as a result. [ 234 ]","former prime minister of malaysia najib razak is found guilty of all seven charges in the first of five trials on the 1mdb scandal , being jailed 12 years and fined rm 210 million as a result.",NaT
202,2020,October,Serious floods,"October 6 – Serious floods affected in Central Vietnam , lasted nearly 3 months and killed at least 249 people.","october 6 – serious floods affected in central vietnam , lasted nearly 3 months and killed at least 249 people.",NaT
468,2021,December,New York City FC,"December 11 – New York City FC defeat the Portland Timbers at Providence Park in Portland, Oregon 5–3 on penalties after a 1–1 draw, and win MLS Cup title for the first time in their history. [ 249 ]","december 11 – new york city fc defeat the portland timbers at providence park in portland, oregon 5–3 on penalties after a 1–1 draw, and win mls cup title for the first time in their history.",NaT
742,2023,May,13,"May 9– The Eurovision Song Contest 2023 is held in Liverpool , UK . [ 124 ] Swedish contestant Loreen wins with the song "" Tattoo "". [ 125 ]","may 9– the eurovision song contest 2023 is held in liverpool , uk . swedish contestant loreen wins with the song "" tattoo "".",NaT
783,2023,July,Bolivia,"July 20 – Bolivia and Iran sign a memorandum of understanding , in an upgrade of bilateral relations, expanding cooperation in the security and defense sectors. [ 172 ]","july 20 – bolivia and iran sign a memorandum of understanding , in an upgrade of bilateral relations, expanding cooperation in the security and defense sectors.",NaT
797,2023,August,2023 Canadian wildfires,2023 Canadian wildfires: 68% of the Northwest Territories are forced to evacuate to other parts of the country due to wildfires.[193],2023 canadian wildfires: 68% of the northwest territories are forced to evacuate to other parts of the country due to wildfires.,NaT
798,2023,August,2023 Canadian wildfires,Saudi Arabia is accused of mass killing hundreds of African migrants attempting to cross its border with Yemen.[194][195],saudi arabia is accused of mass killing hundreds of african migrants attempting to cross its border with yemen.,NaT
807,2023,September,2023 Marrakesh–Safi earthquake,"September 8 – 2023 Marrakesh–Safi earthquake : A 6.9 magnitude earthquake strikes Marrakesh–Safi province in western Morocco , killing at least 2,960 people and damaging historic buildings. [ 207 ]","september 8 – 2023 marrakesh–safi earthquake : a 6.9 magnitude earthquake strikes marrakesh–safi province in western morocco , killing at least 2,960 people and damaging historic buildings.",NaT


Since there are only 8 of the records that failed the datatime conversion, I don't need to go back to review the whole webscraping codes but can manualy update here.

In [33]:
# Manualy add the values according to the information from the other fields
# Note for the two records in 2023, August which does not show clear date, I manually went to the website and looked up the dates
dates = ['2020-07-28', '2020-10-06','2021-12-11','2023-05-09','2023-07-20','2023-08-21','2023-08-21','2023-09-08']

In [55]:
# Manually add datetime values to the 'event_date' field for the given indices
df.loc[[156,202,468,742,783,797,798,807], 'event_date'] = pd.to_datetime(dates)
print('Null records after the update: \n', df.isnull().sum())
print('\nUpdated records:')
df.loc[[156,202,468,742,783,797,798,807]]

Null records after the update: 
 year             0
month            0
date             0
event            0
event_cleaned    0
event_date       0
dtype: int64

Updated records:


Unnamed: 0,year,month,date,event,event_cleaned,event_date
156,2020,July,28 July,"Former Prime Minister of Malaysia Najib Razak is found guilty of all seven charges in the first of five trials on the 1MDB scandal , being jailed 12 years and fined RM 210 million as a result. [ 234 ]","former prime minister of malaysia najib razak is found guilty of all seven charges in the first of five trials on the 1mdb scandal , being jailed 12 years and fined rm 210 million as a result.",2020-07-28
202,2020,October,Serious floods,"October 6 – Serious floods affected in Central Vietnam , lasted nearly 3 months and killed at least 249 people.","october 6 – serious floods affected in central vietnam , lasted nearly 3 months and killed at least 249 people.",2020-10-06
468,2021,December,New York City FC,"December 11 – New York City FC defeat the Portland Timbers at Providence Park in Portland, Oregon 5–3 on penalties after a 1–1 draw, and win MLS Cup title for the first time in their history. [ 249 ]","december 11 – new york city fc defeat the portland timbers at providence park in portland, oregon 5–3 on penalties after a 1–1 draw, and win mls cup title for the first time in their history.",2021-12-11
742,2023,May,13,"May 9– The Eurovision Song Contest 2023 is held in Liverpool , UK . [ 124 ] Swedish contestant Loreen wins with the song "" Tattoo "". [ 125 ]","may 9– the eurovision song contest 2023 is held in liverpool , uk . swedish contestant loreen wins with the song "" tattoo "".",2023-05-09
783,2023,July,Bolivia,"July 20 – Bolivia and Iran sign a memorandum of understanding , in an upgrade of bilateral relations, expanding cooperation in the security and defense sectors. [ 172 ]","july 20 – bolivia and iran sign a memorandum of understanding , in an upgrade of bilateral relations, expanding cooperation in the security and defense sectors.",2023-07-20
797,2023,August,2023 Canadian wildfires,2023 Canadian wildfires: 68% of the Northwest Territories are forced to evacuate to other parts of the country due to wildfires.[193],2023 canadian wildfires: 68% of the northwest territories are forced to evacuate to other parts of the country due to wildfires.,2023-08-21
798,2023,August,2023 Canadian wildfires,Saudi Arabia is accused of mass killing hundreds of African migrants attempting to cross its border with Yemen.[194][195],saudi arabia is accused of mass killing hundreds of african migrants attempting to cross its border with yemen.,2023-08-21
807,2023,September,2023 Marrakesh–Safi earthquake,"September 8 – 2023 Marrakesh–Safi earthquake : A 6.9 magnitude earthquake strikes Marrakesh–Safi province in western Morocco , killing at least 2,960 people and damaging historic buildings. [ 207 ]","september 8 – 2023 marrakesh–safi earthquake : a 6.9 magnitude earthquake strikes marrakesh–safi province in western morocco , killing at least 2,960 people and damaging historic buildings.",2023-09-08


In [57]:
# Get rid of the date info from the event_cleaned field
df.loc[[202,468,742,783,807], 'event_cleaned'] = df.loc[[202,468,742,783,807], 'event_cleaned'].str.replace(r'^[^–\n]*\s*–','', regex=True)
df.loc[[202,468,742,783,807]]

Unnamed: 0,year,month,date,event,event_cleaned,event_date
202,2020,October,Serious floods,"October 6 – Serious floods affected in Central Vietnam , lasted nearly 3 months and killed at least 249 people.","serious floods affected in central vietnam , lasted nearly 3 months and killed at least 249 people.",2020-10-06
468,2021,December,New York City FC,"December 11 – New York City FC defeat the Portland Timbers at Providence Park in Portland, Oregon 5–3 on penalties after a 1–1 draw, and win MLS Cup title for the first time in their history. [ 249 ]","new york city fc defeat the portland timbers at providence park in portland, oregon 5–3 on penalties after a 1–1 draw, and win mls cup title for the first time in their history.",2021-12-11
742,2023,May,13,"May 9– The Eurovision Song Contest 2023 is held in Liverpool , UK . [ 124 ] Swedish contestant Loreen wins with the song "" Tattoo "". [ 125 ]","the eurovision song contest 2023 is held in liverpool , uk . swedish contestant loreen wins with the song "" tattoo "".",2023-05-09
783,2023,July,Bolivia,"July 20 – Bolivia and Iran sign a memorandum of understanding , in an upgrade of bilateral relations, expanding cooperation in the security and defense sectors. [ 172 ]","bolivia and iran sign a memorandum of understanding , in an upgrade of bilateral relations, expanding cooperation in the security and defense sectors.",2023-07-20
807,2023,September,2023 Marrakesh–Safi earthquake,"September 8 – 2023 Marrakesh–Safi earthquake : A 6.9 magnitude earthquake strikes Marrakesh–Safi province in western Morocco , killing at least 2,960 people and damaging historic buildings. [ 207 ]","2023 marrakesh–safi earthquake : a 6.9 magnitude earthquake strikes marrakesh–safi province in western morocco , killing at least 2,960 people and damaging historic buildings.",2023-09-08


In [59]:
# Check the number of duplicated records based on the event_cleaned column
df['event_cleaned'].duplicated().sum()

0

### Step 4: Remove unnecessary fields
Now that we ave the event_date field in datetime format, we can remove the first three columns as they are unnecessary for the project anymore, as well as the event column.

In [62]:
# Drop the 'year','month','date','event' columns
df.drop(columns=['year','month','date','event'], inplace=True)
cols=list(df.columns)
cols

['event_cleaned', 'event_date']

In [64]:
# Move the event_date column to the front
cols.insert(0, cols.pop(cols.index('event_date'))) #Pop the event_date column and insert it at index 0
cols

['event_date', 'event_cleaned']

In [66]:
# Reorder the dataframe
df = df[cols]
df.head()

Unnamed: 0,event_date,event_cleaned
0,2020-01-01,croatia begins its term in the presidency of the european union.
1,2020-01-01,"flash floods struck jakarta, indonesia, killing 66 people in the worst flooding in over a decade."
2,2020-01-02,the royal australian air force and navy are deployed to new south wales and victoria to assist mass evacuation efforts amidst the 2019–20 australian bushfire season .
3,2020-01-03,"a united states drone strike at baghdad international airport kills ten people, including the intended target, an iranian general. qasem soleimani and iraqi paramilitary leader abu mahdi al-muhandis ."
4,2020-01-05,second libyan civil war: president recep tayyip erdoğan announces the deployment of turkish troops to libya on behalf of the united nations-backed government of national accord.


### Step 5: Normalize the event_cleaned field
Now that we have gathered all world events at daily level, there're multiple instances as observed that event_date is duplicated to account for multiple events within a day. Because the yahoo finance data and the economic data are unique in date field, joining to the world event dataframe will inflate the numbers because of one to many relationship. Let's combine all the events within 1 day into a single row to avoid such confusion later.

In [73]:
# Group by 'event_date' and aggregate 'event_cleaned' by concatenating the events
df_normalized = df.groupby('event_date', as_index=False).agg({'event_cleaned': ' '.join})
df_normalized.head(10)

Unnamed: 0,event_date,event_cleaned
0,2020-01-01,"croatia begins its term in the presidency of the european union. flash floods struck jakarta, indonesia, killing 66 people in the worst flooding in over a decade."
1,2020-01-02,the royal australian air force and navy are deployed to new south wales and victoria to assist mass evacuation efforts amidst the 2019–20 australian bushfire season .
2,2020-01-03,"a united states drone strike at baghdad international airport kills ten people, including the intended target, an iranian general. qasem soleimani and iraqi paramilitary leader abu mahdi al-muhandis ."
3,2020-01-05,"second libyan civil war: president recep tayyip erdoğan announces the deployment of turkish troops to libya on behalf of the united nations-backed government of national accord. 2019–20 croatian presidential election: the second round of voting is held, and zoran milanović of the social democratic party of croatia defeats incumbent president kolinda grabar-kitarović."
4,2020-01-08,"iran launches ballistic missiles at two iraqi military bases hosting u.s. soldiers, injuring over 100 personnel. ukraine international airlines flight 752 was shot down by iranian forces shortly after takeoff from tehran imam khomeini international airport, killing all 176 people on board."
5,2020-01-09,"a rare, circumbinary planet called toi 1338-b is discovered. islamic state millitants in the greater sahara assaulted a nigerien military base in chinagodrar, killing at least 89 nigerien soldiers."
6,2020-01-10,haitham bin tariq succeeds qaboos bin said as the sultan of oman .
7,2020-01-11,"presidential and legislative elections are held in taiwan . incumbent president tsai ing-wen is reelected, and the democratic progressive party wins a majority of 67 out of 113 seats in the legislative yuan ."
8,2020-01-12,the taal volcano in the philippines has had its first major eruption since 1977.
9,2020-01-16,"the first impeachment trial of the president of the united states , donald trump , begins in the u.s. senate . he was acquitted on february 5 ."


In [75]:
# Check the duplication for data validation
df_normalized.event_date.duplicated().sum()

0

In [77]:
print('Denormalized dimension of the world events dataframe: ',df.shape)
print('Normalized dimension of the world events dataframe: ',df_normalized.shape)

Denormalized dimension of the world events dataframe:  (1001, 2)
Normalized dimension of the world events dataframe:  (750, 2)


### Ethical Implications of the Data Wrangling

In this milestone, I scraped the world event data from Wikipedia using web scraping techniques, specifically the beautifulsoup library. After extracting the from Wikipedia and turning it into the DataFrame, I made several changes to the data to ensure it was structured and cleaned for analysis later, including standardizing event dates, removing citation numbering and white spaces, filling null values and checking duplications, lower case text fields and consolidating events for each day. The data transformations involved assumptions about how events were categorized, particularly handling days with multiple events. These assumptions could introduce bias if not all events were captured equally, or if some events were prioritized over others due to their placement in the HTML source.

Since Wikipedia is a platform sourced from the general public, there are no strict legal regulations regarding the usage of the data, but it is important to respect Wikipedia's content policies, including attribution and adherence to the site's terms of service. 

One ethical risk is that scraping content without verifying its accuracy could propagate misinformation or incomplete information. Although Wikipedia is considered reliable by many, it is not considered an authoritative source, and data derived from it should be cross-referenced with more credible datasets when possible, or at  least spot validated.

Additionally, citation numbers and formatting artifacts were removed, assuming they were irrelevant for the analysis. While this improves text processing, it may inadvertently strip context that could have been important for certain types of analyses, such as tracking sources of information.
The data was sourced ethically through public access to Wikipedia, but to mitigate potential issues, cross-verification with trusted sources or regular updates of the dataset should be considered. 

Also, transparent documentation of cleaning steps should be maintained, and I should emphasize the ethical and responsible use of the data to avoid misuse of the data.