# <center>Data Wrangling</center>

<img src="../image/wrangling_DALLE.jpeg" width=30% align="right" style="in-line">

>*80% of data wrangling is spent cleaning the data.*
>
>*The other 20% is spent complaining about cleaning the data.*
>
>— ChatGPT 4o

<img src="../image/quote2_ChatGPT.png" width=60% align="left" style="in-line">

## Learning goals

1. Be able to perform web scraping across multiple pages.
2. Be able to load, explore, and understand the structure of datasets using Python.
3. Become familiar with common data wrangling tasks, including handling missing values, merging datasets, filtering, and sorting data.

## Agenda

1. [Review: web scraping](#1)
2. [Data wrangling](#2)

<a name="1"></a>
## Agenda 1. Web scraping

&#x1F4DD; **<font color=dodgerblue>FROM LAST WEEK: </font>** Here is an example project. We would like to find out what the European Union has been doing to advance sustainable mobility and transport. One possible data source is the news (https://transport.ec.europa.eu/news-events/news_en) we just scraped, but we need more information other than the title of the news.

So now please write some code to collect **the date, the title, the short description, the news type, and the link to the full text** of all news. Save the data to a **csv** file.

Here are some tips:
1. How many pages do you need to scrape? Observe how the web addresses change between the first page and the second.
2. Remember we have talked about **avoid overloading servers** in ethics. Make sure to use `time.sleep()`.
3. Maybe AI tools such as ChatGPT can help. But you need to make sure its solution works.

If you would like to challenge yourself, see if you can scrape the full text (not the short description) of the news. Try with one or two pieces of news would be enough.

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import csv
import pandas as pd

### Step 1: Send an HTTP request

In [2]:
url = "https://transport.ec.europa.eu/news-events/news_en?page=0"
page = requests.get(url)

In [3]:
print(page)

<Response [200]>


### Step 2: Parse the HTML content

In [4]:
soup = BeautifulSoup(page.content, 'html.parser')

In [5]:
print(soup)
#print(soup.prettify())

<!DOCTYPE html>

<html dir="ltr" lang="en" prefix="og: https://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="News" name="description"/>
<meta content="en" http-equiv="content-language"/>
<link href="https://transport.ec.europa.eu/news-events/news_en" rel="canonical"/>
<meta content="follow, noindex" name="robots"/>
<meta content="auto" property="og:determiner"/>
<meta content="Mobility and Transport" property="og:site_name"/>
<meta content="website" property="og:type"/>
<meta content="https://transport.ec.europa.eu/news-events/news_en" property="og:url"/>
<meta content="News" property="og:title"/>
<meta content="News" property="og:description"/>
<meta content="https://transport.ec.europa.eu/profiles/contrib/ewcms/modules/ewcms_seo/assets/images/ec-socialmedia-fallback.png" property="og:image"/>
<meta content="Mobility and Transport" property="og:image:alt"/>
<meta content="summary_large_image" name="twitter:card"/>
<meta content="News" name="twitter:title"/>
<meta content=

### Step 3: Locate the HTML elements containing the desired data

Chrome: `View` -> `Developer` -> `Inspect Elements`

<img src="../image/news_elements.png">

In [6]:
news = soup.find_all("article", class_="ecl-content-item")

In [7]:
print(news)

[<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2024-10-17T12:00:00Z">17 October 2024</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone" href="/news-events/news/solidarity-lanes-latest-figures-september-2024-2024-10-17_en">Solidarity Lanes: Latest figures – September 2024</a></div><div class="ecl-content-block__description"><p>Latest figures on Ukrainian exports and imports via the EU-Ukraine Solidarity Lanes: new transport routes established in the face of Russia’s war of aggression against Ukraine.</p></div><ul class="ecl-content-block__secondary-meta-container"><li class="ecl-content-block__secondary-meta-item"><svg aria-hidden

In [8]:
len(news)

10

### Step 4: Extract data

In [9]:
for item in news:
    date = item.time.attrs['datetime']
    title = item.find("a", class_="ecl-link ecl-link--standalone").get_text()
    desc = item.find("div", class_="ecl-content-block__description").get_text()
    news_type = item.find("li", class_="ecl-content-block__primary-meta-item").get_text()
    link = item.find("a", class_="ecl-link ecl-link--standalone", href=True)["href"]
    print(date,title,desc)
    print(news_type)
    print(link)
    print("=====")

2024-10-17T12:00:00Z Solidarity Lanes: Latest figures – September 2024 Latest figures on Ukrainian exports and imports via the EU-Ukraine Solidarity Lanes: new transport routes established in the face of Russia’s war of aggression against Ukraine.
News article
/news-events/news/solidarity-lanes-latest-figures-september-2024-2024-10-17_en
=====
2024-10-10T12:00:00Z 20,400 lives lost in EU road crashes last year In 2023, 20,400 people lost their lives in road crashes across the EU, marking a 1% decrease from the previous year, with 46 road deaths per million inhabitants.
News article
/news-events/news/20400-lives-lost-eu-road-crashes-last-year-2024-10-10_en
=====
2024-10-03T12:00:00Z October infringements package: key decisions In its regular package of infringement decisions, the European Commission pursues legal action against Member States for failing to comply with their obligations under EU law.
News article
https://ec.europa.eu/commission/presscorner/detail/en/inf_24_4561
=====
202

### Save data to csv

To scrape multiple pages, we need to observe how the web addresses change.

In [10]:
#page_number = range(7,18) # year 2023. your number could be different
page_number = range(0,8) # year 2024
stem = "https://transport.ec.europa.eu/news-events/news_en?page="

In [None]:
#csv_filename = "news2023.csv" # give the csv file a name
#csv_filename = "news2024.csv"

In [None]:
for i in page_number: # loop over all the pages
    url = stem + str(i)
    print(url)
    
    page = requests.get(url) # send the request
    soup = BeautifulSoup(page.content, 'html.parser') # parse the content
    news = soup.find_all("article", class_="ecl-content-item") # locate all the news
    
    for item in news: # loop over each article
        try: # use try except to skip error e.g., NoneType object - empty elements
            date = item.time.attrs['datetime']
        except (AttributeError, KeyError, TypeError) as e:
            date = ""
            print(f"Error extracting date: {e}")
            print(item)
            print("====")
        try:
            title = item.find("a", class_="ecl-link ecl-link--standalone").get_text()
        except AttributeError as e:
            title = ""
            print(f"Error extracting title: {e}")
            print(item)
            print("====")
        try:
            desc = item.find("div", class_="ecl-content-block__description").get_text()
        except AttributeError as e:
            desc = ""
            print(f"Error extracting description: {e}")
            print(item)
            print("====")
        try:
            news_type = item.find("li", class_="ecl-content-block__primary-meta-item").get_text()
        except AttributeError as e:
            news_type = ""
            print(f"Error extracting news type: {e}")
            print(item)
            print("====")
        try:
            link = item.find("a", class_="ecl-link ecl-link--standalone", href=True)["href"]
        except (AttributeError, TypeError) as e:
            link = ""
            print(f"Error extracting link: {e}")
            print(item)
            print("====")
        csv.writer(open(csv_filename, "a", encoding="latin1")).writerow([date, title, desc, news_type, link])
        
    time.sleep(2)

I got 8 errors running the above cell (for year 2023). It is okay. I have saved the errors in case we need to go back.

<a name="error"></a>
The error output

```
https://transport.ec.europa.eu/news-events/news_en?page=7
https://transport.ec.europa.eu/news-events/news_en?page=8
Error extracting title: 'NoneType' object has no attribute 'get_text'
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-12-12T12:00:00Z">12 December 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.wbif.eu/news-details/transport-ministers-discuss-reforms-within-western-balkans-transport-community-summit"><span class="ecl-link__label">Transport Ministers discuss reforms within Western Balkans at Transport Community Summit</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>Transport ministers from the Western Balkans, Georgia, Moldova, and Ukraine, as well as the European Commission, gathered at the annual Ministerial Council Meeting of the Transport Community in Skopje, North Macedonia, on 12 December 2023.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
Error extracting link: 'NoneType' object is not subscriptable
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-12-12T12:00:00Z">12 December 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.wbif.eu/news-details/transport-ministers-discuss-reforms-within-western-balkans-transport-community-summit"><span class="ecl-link__label">Transport Ministers discuss reforms within Western Balkans at Transport Community Summit</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>Transport ministers from the Western Balkans, Georgia, Moldova, and Ukraine, as well as the European Commission, gathered at the annual Ministerial Council Meeting of the Transport Community in Skopje, North Macedonia, on 12 December 2023.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
https://transport.ec.europa.eu/news-events/news_en?page=9
Error extracting title: 'NoneType' object has no attribute 'get_text'
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-10-25T12:00:00Z">25 October 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.ecac-ceac.org/news/960-press-release-european-aviation-leaders-unite-for-holistic-sustainability-in-the-ecac-eu-dialogue"><span class="ecl-link__label">European Commission joins the 12th edition of the ECAC/EU Dialogue</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>European aviation leaders have come together in Valencia for the 12th edition of the ECAC/EU Dialogue, reaffirming their commitment to sustainable aviation.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
Error extracting link: 'NoneType' object is not subscriptable
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-10-25T12:00:00Z">25 October 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.ecac-ceac.org/news/960-press-release-european-aviation-leaders-unite-for-holistic-sustainability-in-the-ecac-eu-dialogue"><span class="ecl-link__label">European Commission joins the 12th edition of the ECAC/EU Dialogue</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>European aviation leaders have come together in Valencia for the 12th edition of the ECAC/EU Dialogue, reaffirming their commitment to sustainable aviation.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
https://transport.ec.europa.eu/news-events/news_en?page=10
https://transport.ec.europa.eu/news-events/news_en?page=11
https://transport.ec.europa.eu/news-events/news_en?page=12
https://transport.ec.europa.eu/news-events/news_en?page=13
https://transport.ec.europa.eu/news-events/news_en?page=14
https://transport.ec.europa.eu/news-events/news_en?page=15
Error extracting title: 'NoneType' object has no attribute 'get_text'
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-03-14T12:00:00Z">14 March 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.eib.org/en/press/all/2023-108-investeu-eur3-4-billion-to-modernise-the-palermo-catania-railway-line"><span class="ecl-link__label">Italy: InvestEU - €3.4 billion to modernise the Palermo-Catania railway line</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>The modernisation of 178 km of the Palermo-Catania line will reduce current travel times by a third, linking the two cities with a direct two-hour rail service, which will have a significant impact on economic, social, and sustainable development in Sicily.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
Error extracting link: 'NoneType' object is not subscriptable
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-03-14T12:00:00Z">14 March 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.eib.org/en/press/all/2023-108-investeu-eur3-4-billion-to-modernise-the-palermo-catania-railway-line"><span class="ecl-link__label">Italy: InvestEU - €3.4 billion to modernise the Palermo-Catania railway line</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>The modernisation of 178 km of the Palermo-Catania line will reduce current travel times by a third, linking the two cities with a direct two-hour rail service, which will have a significant impact on economic, social, and sustainable development in Sicily.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
Error extracting title: 'NoneType' object has no attribute 'get_text'
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-02-28T12:00:00Z">28 February 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://mobilityweek.eu/media-corner/"><span class="ecl-link__label">Braga, Sofia and Zagreb among the finalists for European urban mobility awards</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>The finalists for the EUROPEANMOBILITYWEEK award 2022 and the first-ever MOBILITYACTION award have been announced.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
Error extracting link: 'NoneType' object is not subscriptable
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-02-28T12:00:00Z">28 February 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://mobilityweek.eu/media-corner/"><span class="ecl-link__label">Braga, Sofia and Zagreb among the finalists for European urban mobility awards</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>The finalists for the EUROPEANMOBILITYWEEK award 2022 and the first-ever MOBILITYACTION award have been announced.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
https://transport.ec.europa.eu/news-events/news_en?page=16
https://transport.ec.europa.eu/news-events/news_en?page=17
```

<a name="2"></a>
## Agenda 2. Data wrangling

Data wrangling is the process of converting raw data into a usable form. It typically involves examining the data, handling missing values, cleaning, and transforming it.

### Examining data

We start by loading our datasets and inspecting them to get a better sense of their structure. 

In [62]:
news2023 = pd.read_csv("news2023.csv", encoding="latin1", names=["date", "title", "desc", "type", "link"]) # import csv with column names
news2024 = pd.read_csv("news2024.csv", encoding="latin1", names=["date", "title", "desc", "type", "link"])

In [63]:
news2023

Unnamed: 0,date,title,desc,type,link
0,2024-01-22T12:00:00Z,Save the date: Global TBO Symposium (4-6 June ...,Towards global trajectory-based operations (TBO),News article,https://transport.ec.europa.eu/news-events/eve...
1,2024-01-19T12:00:00Z,European Commission workshop addresses key cha...,"Decarbonisation, digitalisation, social securi...",News article,/news-events/news/european-commission-workshop...
2,2024-01-16T12:00:00Z,âEurope for Aviationâ teams up for Airspa...,"The âEurope for Aviationâ partners, consis...",News article,/news-events/news/europe-aviation-teams-airspa...
3,2023-12-21T12:00:00Z,December Infringements package: key decisions,December Infringements package: key decisions,News article,https://ec.europa.eu/commission/presscorner/de...
4,2023-12-19T12:00:00Z,Provisional agreement on more sustainable and ...,TEN-T Trilogue,News article,/news-events/news/provisional-agreement-more-s...
...,...,...,...,...,...
95,2022-12-19T12:00:00Z,Future Mobility: â¬40 million EIB loan for Ca...,The EIB is providing the Spanish multi-mobilit...,News article,/news-events/news/future-mobility-eu40-million...
96,2022-12-16T12:00:00Z,New shipping fuel standards to reduce sulphur ...,The Commission welcomes the agreement reached ...,News article,/news-events/news/new-shipping-fuel-standards-...
97,2022-12-15T12:00:00Z,Vacancy for one post of Seconded National Expe...,The deadline for applications is 31/01/2023 at...,News article,https://rail-research.europa.eu/about-europes-...
98,2022-12-09T12:00:00Z,EU-Ukraine Solidarity Lanes: Commission and EB...,The European Commission and the European Bank ...,News article,/news-events/news/eu-ukraine-solidarity-lanes-...


In [17]:
news2024.head()

Unnamed: 0,date,title,desc,type,link
0,2024-10-02T12:00:00Z,Single European Sky: annual monitoring highlig...,The Performance Review Body (PRB) has publishe...,News article,/news-events/news/single-european-sky-annual-m...
1,2024-09-25T12:00:00Z,Commission seeks feedback on the Flight Emissi...,The Commission has launched a public consultat...,News article,/news-events/news/commission-seeks-feedback-fl...
2,2024-09-25T12:00:00Z,Women in Rail 2024 Award to boost female talen...,The Commission together with Europe’s Rail Joi...,News article,/news-events/news/women-rail-2024-award-boost-...
3,2024-09-24T12:00:00Z,CEF Transport: €2.5 billion to boost resilienc...,The 2024 CEF for Transport call for proposals ...,News article,/news-events/news/cef-transport-eu25-billion-b...
4,2024-09-23T12:00:00Z,Rail market opening: competition leads to lowe...,A study released by the European Commission hi...,News article,/news-events/news/rail-market-opening-competit...


In [18]:
len(news2024) # check the number of rows

80

In [19]:
news2024.shape # check the number of rows and columns

(80, 5)

In [20]:
news2024.info() # summarize by the columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    80 non-null     object
 1   title   80 non-null     object
 2   desc    80 non-null     object
 3   type    80 non-null     object
 4   link    80 non-null     object
dtypes: object(5)
memory usage: 3.3+ KB


In [21]:
news2024.describe() # useful for numerical data

Unnamed: 0,date,title,desc,type,link
count,80,80,80,80,80
unique,64,80,75,2,80
top,2024-04-10T12:00:00Z,Single European Sky: annual monitoring highlig...,Latest figures on Ukrainian exports and import...,News article,/news-events/news/single-european-sky-annual-m...
freq,2,1,5,79,1


In [23]:
type(news2024.date.iloc[0]) # check the data type

str

In [24]:
news2023.type.unique() # check the unique values of a column

array(['News article', 'Supplementary information', 'Speech'],
      dtype=object)

In [25]:
news2023.type.value_counts() # check each value's number of occurrences

type
News article                 98
Supplementary information     1
Speech                        1
Name: count, dtype: int64

### Handling missing values

In [26]:
news2024.isnull().sum() # identify missing data

date     0
title    0
desc     0
type     0
link     0
dtype: int64

In [27]:
news2023.isnull().sum()

date     0
title    4
desc     0
type     0
link     4
dtype: int64

In [28]:
news2023[news2023.title.isnull()] # pinpoint which rows have missing values

Unnamed: 0,date,title,desc,type,link
10,2023-12-12T12:00:00Z,,"Transport ministers from the Western Balkans, ...",News article,
21,2023-10-25T12:00:00Z,,European aviation leaders have come together i...,News article,
73,2023-03-14T12:00:00Z,,The modernisation of 178 km of the Palermo-Cat...,News article,
77,2023-02-28T12:00:00Z,,The finalists for the EUROPEANMOBILITYWEEK awa...,News article,


In [29]:
news2023[news2023.link.isnull()] # check again by "link"

Unnamed: 0,date,title,desc,type,link
10,2023-12-12T12:00:00Z,,"Transport ministers from the Western Balkans, ...",News article,
21,2023-10-25T12:00:00Z,,European aviation leaders have come together i...,News article,
73,2023-03-14T12:00:00Z,,The modernisation of 178 km of the Palermo-Cat...,News article,
77,2023-02-28T12:00:00Z,,The finalists for the EUROPEANMOBILITYWEEK awa...,News article,


Missing values can often skew analyses or create errors during analysis. Here are some common strategies for handling missing values:

- Manually review and fix the missing data (if possible)
- Remove missing data e.g., drop rows
- Impute missing data e.g., replace with the mean (for numerical data)
- Fill with a specific value e.g., replace with "unknown" or "NA"
- Predictive imputation e.g., use machine learning models to predict missing values based on other features
- Leave as missing

For our example, we choose the first option because we have access to the original data source. Let's go back to the previous error messages.

&#x1F4A1; **Markdown**: Markdown is a markup language. It can be used as a text-to-HTML conversion tool. Read more on Jupyter [Markdown cells](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html). It can help us read the error messages and fix the missing values. Below is a direct copy & paste of the [error messages](#error) (in a Markedown cell).

https://transport.ec.europa.eu/news-events/news_en?page=7
https://transport.ec.europa.eu/news-events/news_en?page=8
Error extracting title: 'NoneType' object has no attribute 'get_text'
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-12-12T12:00:00Z">12 December 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.wbif.eu/news-details/transport-ministers-discuss-reforms-within-western-balkans-transport-community-summit"><span class="ecl-link__label">Transport Ministers discuss reforms within Western Balkans at Transport Community Summit</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>Transport ministers from the Western Balkans, Georgia, Moldova, and Ukraine, as well as the European Commission, gathered at the annual Ministerial Council Meeting of the Transport Community in Skopje, North Macedonia, on 12 December 2023.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
Error extracting link: 'NoneType' object is not subscriptable
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-12-12T12:00:00Z">12 December 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.wbif.eu/news-details/transport-ministers-discuss-reforms-within-western-balkans-transport-community-summit"><span class="ecl-link__label">Transport Ministers discuss reforms within Western Balkans at Transport Community Summit</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>Transport ministers from the Western Balkans, Georgia, Moldova, and Ukraine, as well as the European Commission, gathered at the annual Ministerial Council Meeting of the Transport Community in Skopje, North Macedonia, on 12 December 2023.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
https://transport.ec.europa.eu/news-events/news_en?page=9
Error extracting title: 'NoneType' object has no attribute 'get_text'
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-10-25T12:00:00Z">25 October 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.ecac-ceac.org/news/960-press-release-european-aviation-leaders-unite-for-holistic-sustainability-in-the-ecac-eu-dialogue"><span class="ecl-link__label">European Commission joins the 12th edition of the ECAC/EU Dialogue</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>European aviation leaders have come together in Valencia for the 12th edition of the ECAC/EU Dialogue, reaffirming their commitment to sustainable aviation.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
Error extracting link: 'NoneType' object is not subscriptable
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-10-25T12:00:00Z">25 October 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.ecac-ceac.org/news/960-press-release-european-aviation-leaders-unite-for-holistic-sustainability-in-the-ecac-eu-dialogue"><span class="ecl-link__label">European Commission joins the 12th edition of the ECAC/EU Dialogue</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>European aviation leaders have come together in Valencia for the 12th edition of the ECAC/EU Dialogue, reaffirming their commitment to sustainable aviation.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
https://transport.ec.europa.eu/news-events/news_en?page=10
https://transport.ec.europa.eu/news-events/news_en?page=11
https://transport.ec.europa.eu/news-events/news_en?page=12
https://transport.ec.europa.eu/news-events/news_en?page=13
https://transport.ec.europa.eu/news-events/news_en?page=14
https://transport.ec.europa.eu/news-events/news_en?page=15
Error extracting title: 'NoneType' object has no attribute 'get_text'
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-03-14T12:00:00Z">14 March 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.eib.org/en/press/all/2023-108-investeu-eur3-4-billion-to-modernise-the-palermo-catania-railway-line"><span class="ecl-link__label">Italy: InvestEU - €3.4 billion to modernise the Palermo-Catania railway line</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>The modernisation of 178 km of the Palermo-Catania line will reduce current travel times by a third, linking the two cities with a direct two-hour rail service, which will have a significant impact on economic, social, and sustainable development in Sicily.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
Error extracting link: 'NoneType' object is not subscriptable
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-03-14T12:00:00Z">14 March 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://www.eib.org/en/press/all/2023-108-investeu-eur3-4-billion-to-modernise-the-palermo-catania-railway-line"><span class="ecl-link__label">Italy: InvestEU - €3.4 billion to modernise the Palermo-Catania railway line</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>The modernisation of 178 km of the Palermo-Catania line will reduce current travel times by a third, linking the two cities with a direct two-hour rail service, which will have a significant impact on economic, social, and sustainable development in Sicily.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
Error extracting title: 'NoneType' object has no attribute 'get_text'
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-02-28T12:00:00Z">28 February 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://mobilityweek.eu/media-corner/"><span class="ecl-link__label">Braga, Sofia and Zagreb among the finalists for European urban mobility awards</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>The finalists for the EUROPEANMOBILITYWEEK award 2022 and the first-ever MOBILITYACTION award have been announced.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
Error extracting link: 'NoneType' object is not subscriptable
<article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2023-02-28T12:00:00Z">28 February 2023</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone ecl-link--icon" href="https://mobilityweek.eu/media-corner/"><span class="ecl-link__label">Braga, Sofia and Zagreb among the finalists for European urban mobility awards</span><svg aria-hidden="false" class="ecl-icon ecl-icon--2xs ecl-link__icon" focusable="false"><use xlink:href="/themes/contrib/oe_theme/dist/ec/images/icons/sprites/icons.svg#external"></use></svg></a></div><div class="ecl-content-block__description"><p>The finalists for the EUROPEANMOBILITYWEEK award 2022 and the first-ever MOBILITYACTION award have been announced.</p></div><div class="ecl-content-block__list-container"></div></div></article>
====
https://transport.ec.europa.eu/news-events/news_en?page=16
https://transport.ec.europa.eu/news-events/news_en?page=17

In [30]:
news2023.title.iloc[10]# the first missing title

nan

In [31]:
# write the title to the cell
news2023.iloc[10].title = "Transport Ministers discuss reforms within Western Balkans at Transport Community Summit"

In [32]:
news2023[news2023.title.isnull()] # check missing titles again. the above one is no longer there

Unnamed: 0,date,title,desc,type,link
21,2023-10-25T12:00:00Z,,European aviation leaders have come together i...,News article,
73,2023-03-14T12:00:00Z,,The modernisation of 178 km of the Palermo-Cat...,News article,
77,2023-02-28T12:00:00Z,,The finalists for the EUROPEANMOBILITYWEEK awa...,News article,


In [33]:
# do the same for the missing link
news2023.iloc[10].link = "https://www.wbif.eu/news-details/transport-ministers-discuss-reforms-within-western-balkans-transport-community-summit"

In [34]:
# do the same for the rest missing values
news2023.iloc[21].title = "European Commission joins the 12th edition of the ECAC/EU Dialogue"
news2023.iloc[21].link = "https://www.ecac-ceac.org/news/960-press-release-european-aviation-leaders-unite-for-holistic-sustainability-in-the-ecac-eu-dialogue"
news2023.iloc[73].title = "Italy: InvestEU - €3.4 billion to modernise the Palermo-Catania railway line"
news2023.iloc[73].link = "https://www.eib.org/en/press/all/2023-108-investeu-eur3-4-billion-to-modernise-the-palermo-catania-railway-line"
news2023.iloc[77].title = "Braga, Sofia and Zagreb among the finalists for European urban mobility awards"
news2023.iloc[77].link = "https://mobilityweek.eu/media-corner/"

In [35]:
news2023.isnull().sum() # check again. no more missing values

date     0
title    0
desc     0
type     0
link     0
dtype: int64

### Merging datasets

There are many different ways of merging datasets. Check this [documentation](https://pandas.pydata.org/docs/user_guide/merging.html) or ask ChatGPT. Let's combine our two datasets into one.

In [36]:
news = pd.concat([news2023, news2024], ignore_index=True) # concatenate and re-index

In [37]:
news

Unnamed: 0,date,title,desc,type,link
0,2024-01-22T12:00:00Z,Save the date: Global TBO Symposium (4-6 June ...,Towards global trajectory-based operations (TBO),News article,https://transport.ec.europa.eu/news-events/eve...
1,2024-01-19T12:00:00Z,European Commission workshop addresses key cha...,"Decarbonisation, digitalisation, social securi...",News article,/news-events/news/european-commission-workshop...
2,2024-01-16T12:00:00Z,“Europe for Aviation” teams up for Airspace W...,"The “Europe for Aviation” partners, consisting...",News article,/news-events/news/europe-aviation-teams-airspa...
3,2023-12-21T12:00:00Z,December Infringements package: key decisions,December Infringements package: key decisions,News article,https://ec.europa.eu/commission/presscorner/de...
4,2023-12-19T12:00:00Z,Provisional agreement on more sustainable and ...,TEN-T Trilogue,News article,/news-events/news/provisional-agreement-more-s...
...,...,...,...,...,...
175,2023-12-18T12:00:00Z,Joint Statement on Higher Airspace Operations ...,Joint Statement on Higher Airspace Operations ...,News article,/news-events/news/joint-statement-higher-airsp...
176,2023-12-15T12:00:00Z,Moldova receives additional funding to improve...,The European Investment Bank has today signed ...,News article,https://ec.europa.eu/commission/presscorner/de...
177,2023-12-14T12:00:00Z,Single European Sky: Performance Review Body p...,The Performance Review Body (PRB) of the Singl...,News article,/news-events/news/single-european-sky-performa...
178,2023-12-14T12:00:00Z,New website of the EU Urban Mobility Observatory,New website of the EU Urban Mobility Observatory,News article,https://urban-mobility-observatory.transport.e...


In [38]:
news.isnull().sum()

date     0
title    0
desc     0
type     0
link     0
dtype: int64

Nothing is missing. However, we might have created duplicated values when merging datasets.

### Cleaning

In [39]:
news[news.duplicated()] # identify duplicated rows

Unnamed: 0,date,title,desc,type,link
170,2024-01-22T12:00:00Z,Save the date: Global TBO Symposium (4-6 June ...,Towards global trajectory-based operations (TBO),News article,https://transport.ec.europa.eu/news-events/eve...
171,2024-01-19T12:00:00Z,European Commission workshop addresses key cha...,"Decarbonisation, digitalisation, social securi...",News article,/news-events/news/european-commission-workshop...
172,2024-01-16T12:00:00Z,“Europe for Aviation” teams up for Airspace W...,"The “Europe for Aviation” partners, consisting...",News article,/news-events/news/europe-aviation-teams-airspa...
173,2023-12-21T12:00:00Z,December Infringements package: key decisions,December Infringements package: key decisions,News article,https://ec.europa.eu/commission/presscorner/de...
174,2023-12-19T12:00:00Z,Provisional agreement on more sustainable and ...,TEN-T Trilogue,News article,/news-events/news/provisional-agreement-more-s...
175,2023-12-18T12:00:00Z,Joint Statement on Higher Airspace Operations ...,Joint Statement on Higher Airspace Operations ...,News article,/news-events/news/joint-statement-higher-airsp...
176,2023-12-15T12:00:00Z,Moldova receives additional funding to improve...,The European Investment Bank has today signed ...,News article,https://ec.europa.eu/commission/presscorner/de...
177,2023-12-14T12:00:00Z,Single European Sky: Performance Review Body p...,The Performance Review Body (PRB) of the Singl...,News article,/news-events/news/single-european-sky-performa...
178,2023-12-14T12:00:00Z,New website of the EU Urban Mobility Observatory,New website of the EU Urban Mobility Observatory,News article,https://urban-mobility-observatory.transport.e...
179,2023-12-13T12:00:00Z,Call for applications for the selection of mem...,The Commission is calling for applications wit...,News article,/news-events/news/call-applications-selection-...


In [40]:
news.duplicated() # returns a Boolean Series indicating whether each row is a duplicate of a previous row

0      False
1      False
2      False
3      False
4      False
       ...  
175     True
176     True
177     True
178     True
179     True
Length: 180, dtype: bool

In [41]:
news.duplicated().tail(11)

169    False
170     True
171     True
172     True
173     True
174     True
175     True
176     True
177     True
178     True
179     True
dtype: bool

In [42]:
news = news.drop_duplicates() # drop duplicated rows

In [43]:
news

Unnamed: 0,date,title,desc,type,link
0,2024-01-22T12:00:00Z,Save the date: Global TBO Symposium (4-6 June ...,Towards global trajectory-based operations (TBO),News article,https://transport.ec.europa.eu/news-events/eve...
1,2024-01-19T12:00:00Z,European Commission workshop addresses key cha...,"Decarbonisation, digitalisation, social securi...",News article,/news-events/news/european-commission-workshop...
2,2024-01-16T12:00:00Z,“Europe for Aviation” teams up for Airspace W...,"The “Europe for Aviation” partners, consisting...",News article,/news-events/news/europe-aviation-teams-airspa...
3,2023-12-21T12:00:00Z,December Infringements package: key decisions,December Infringements package: key decisions,News article,https://ec.europa.eu/commission/presscorner/de...
4,2023-12-19T12:00:00Z,Provisional agreement on more sustainable and ...,TEN-T Trilogue,News article,/news-events/news/provisional-agreement-more-s...
...,...,...,...,...,...
165,2024-01-29T12:00:00Z,Commission proposes to modernise river informa...,The Commission adopted a proposal to improve t...,News article,/news-events/news/commission-proposes-modernis...
166,2024-01-25T12:00:00Z,New report comparing air traffic management pe...,In the ongoing series of comparative reports b...,News article,/news-events/news/new-report-comparing-air-tra...
167,2024-01-25T12:00:00Z,STF: Call for applications for new sub-groups ...,Sustainable Transport Forum – new calls for ap...,News article,/news-events/news/stf-call-applications-new-su...
168,2024-01-24T12:00:00Z,Commission supports military mobility projects...,The Commission will fund 38 additional militar...,News article,/news-events/news/commission-supports-military...


We also want to fix the date using [`pandas.to_datetime` function](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html). 

In [44]:
type(news.date[0])

str

In [45]:
news.loc[:, "date"] = pd.to_datetime(news.date) # change the data type to date

In [46]:
type(news.date[0])

pandas._libs.tslibs.timestamps.Timestamp

In [47]:
news

Unnamed: 0,date,title,desc,type,link
0,2024-01-22 12:00:00+00:00,Save the date: Global TBO Symposium (4-6 June ...,Towards global trajectory-based operations (TBO),News article,https://transport.ec.europa.eu/news-events/eve...
1,2024-01-19 12:00:00+00:00,European Commission workshop addresses key cha...,"Decarbonisation, digitalisation, social securi...",News article,/news-events/news/european-commission-workshop...
2,2024-01-16 12:00:00+00:00,“Europe for Aviation” teams up for Airspace W...,"The “Europe for Aviation” partners, consisting...",News article,/news-events/news/europe-aviation-teams-airspa...
3,2023-12-21 12:00:00+00:00,December Infringements package: key decisions,December Infringements package: key decisions,News article,https://ec.europa.eu/commission/presscorner/de...
4,2023-12-19 12:00:00+00:00,Provisional agreement on more sustainable and ...,TEN-T Trilogue,News article,/news-events/news/provisional-agreement-more-s...
...,...,...,...,...,...
165,2024-01-29 12:00:00+00:00,Commission proposes to modernise river informa...,The Commission adopted a proposal to improve t...,News article,/news-events/news/commission-proposes-modernis...
166,2024-01-25 12:00:00+00:00,New report comparing air traffic management pe...,In the ongoing series of comparative reports b...,News article,/news-events/news/new-report-comparing-air-tra...
167,2024-01-25 12:00:00+00:00,STF: Call for applications for new sub-groups ...,Sustainable Transport Forum – new calls for ap...,News article,/news-events/news/stf-call-applications-new-su...
168,2024-01-24 12:00:00+00:00,Commission supports military mobility projects...,The Commission will fund 38 additional militar...,News article,/news-events/news/commission-supports-military...


In [48]:
news = news.sort_values(by="date", ascending=False) # sort the rows by date

In [49]:
news

Unnamed: 0,date,title,desc,type,link
100,2024-10-02 12:00:00+00:00,Single European Sky: annual monitoring highlig...,The Performance Review Body (PRB) has publishe...,News article,/news-events/news/single-european-sky-annual-m...
101,2024-09-25 12:00:00+00:00,Commission seeks feedback on the Flight Emissi...,The Commission has launched a public consultat...,News article,/news-events/news/commission-seeks-feedback-fl...
102,2024-09-25 12:00:00+00:00,Women in Rail 2024 Award to boost female talen...,The Commission together with Europe’s Rail Joi...,News article,/news-events/news/women-rail-2024-award-boost-...
103,2024-09-24 12:00:00+00:00,CEF Transport: €2.5 billion to boost resilienc...,The 2024 CEF for Transport call for proposals ...,News article,/news-events/news/cef-transport-eu25-billion-b...
104,2024-09-23 12:00:00+00:00,Rail market opening: competition leads to lowe...,A study released by the European Commission hi...,News article,/news-events/news/rail-market-opening-competit...
...,...,...,...,...,...
95,2022-12-19 12:00:00+00:00,Future Mobility: €40 million EIB loan for Cabi...,The EIB is providing the Spanish multi-mobilit...,News article,/news-events/news/future-mobility-eu40-million...
96,2022-12-16 12:00:00+00:00,New shipping fuel standards to reduce sulphur ...,The Commission welcomes the agreement reached ...,News article,/news-events/news/new-shipping-fuel-standards-...
97,2022-12-15 12:00:00+00:00,Vacancy for one post of Seconded National Expe...,The deadline for applications is 31/01/2023 at...,News article,https://rail-research.europa.eu/about-europes-...
98,2022-12-09 12:00:00+00:00,EU-Ukraine Solidarity Lanes: Commission and EB...,The European Commission and the European Bank ...,News article,/news-events/news/eu-ukraine-solidarity-lanes-...


In [50]:
news.head(10)

Unnamed: 0,date,title,desc,type,link
100,2024-10-02 12:00:00+00:00,Single European Sky: annual monitoring highlig...,The Performance Review Body (PRB) has publishe...,News article,/news-events/news/single-european-sky-annual-m...
101,2024-09-25 12:00:00+00:00,Commission seeks feedback on the Flight Emissi...,The Commission has launched a public consultat...,News article,/news-events/news/commission-seeks-feedback-fl...
102,2024-09-25 12:00:00+00:00,Women in Rail 2024 Award to boost female talen...,The Commission together with Europe’s Rail Joi...,News article,/news-events/news/women-rail-2024-award-boost-...
103,2024-09-24 12:00:00+00:00,CEF Transport: €2.5 billion to boost resilienc...,The 2024 CEF for Transport call for proposals ...,News article,/news-events/news/cef-transport-eu25-billion-b...
104,2024-09-23 12:00:00+00:00,Rail market opening: competition leads to lowe...,A study released by the European Commission hi...,News article,/news-events/news/rail-market-opening-competit...
105,2024-09-20 12:00:00+00:00,Solidarity Lanes: Latest figures – August 2024,Latest figures on Ukrainian exports and import...,News article,/news-events/news/solidarity-lanes-latest-figu...
106,2024-09-16 12:00:00+00:00,European Mobility Week kicks off to promote sh...,"The European Mobility Week, an annual event pr...",News article,/news-events/news/european-mobility-week-kicks...
107,2024-09-09 12:00:00+00:00,TEN-T Corridor Coordinators,Nine European Coordinators have been designate...,News article,/news-events/news/ten-t-corridor-coordinators-...
108,2024-08-29 12:00:00+00:00,Sustainable Transport Forum - new sub-groups o...,In the wake of the adoption of the new Alterna...,News article,/news-events/news/sustainable-transport-forum-...
109,2024-08-14 12:00:00+00:00,Solidarity Lanes: Latest figures – July 2024,Latest figures on Ukrainian exports and import...,News article,/news-events/news/solidarity-lanes-latest-figu...


In [51]:
news['date'] = pd.to_datetime(news['date']) # make sure the 'date' column is of datetime type
news = news[news['date'].dt.year!=2022] # remove news from 2022

In [52]:
news

Unnamed: 0,date,title,desc,type,link
100,2024-10-02 12:00:00+00:00,Single European Sky: annual monitoring highlig...,The Performance Review Body (PRB) has publishe...,News article,/news-events/news/single-european-sky-annual-m...
101,2024-09-25 12:00:00+00:00,Commission seeks feedback on the Flight Emissi...,The Commission has launched a public consultat...,News article,/news-events/news/commission-seeks-feedback-fl...
102,2024-09-25 12:00:00+00:00,Women in Rail 2024 Award to boost female talen...,The Commission together with Europe’s Rail Joi...,News article,/news-events/news/women-rail-2024-award-boost-...
103,2024-09-24 12:00:00+00:00,CEF Transport: €2.5 billion to boost resilienc...,The 2024 CEF for Transport call for proposals ...,News article,/news-events/news/cef-transport-eu25-billion-b...
104,2024-09-23 12:00:00+00:00,Rail market opening: competition leads to lowe...,A study released by the European Commission hi...,News article,/news-events/news/rail-market-opening-competit...
...,...,...,...,...,...
86,2023-01-31 12:00:00+00:00,Stakeholder conference on mitigating the socia...,The Commission plans to issue a Recommendation...,News article,/news-events/news/stakeholder-conference-mitig...
88,2023-01-30 12:00:00+00:00,CINEA launches a new public dashboard covering...,"The European Climate, Infrastructure and Envir...",News article,https://cinea.ec.europa.eu/news-events/news/ci...
89,2023-01-26 12:00:00+00:00,January Infringements package: key decisions,January Infringements package: key decisions,News article,https://ec.europa.eu/commission/presscorner/de...
90,2023-01-26 12:00:00+00:00,New EU rules on dedicated airspace for drones ...,"As of today, EU rules establishing a dedicated...",News article,/news-events/news/new-eu-rules-dedicated-airsp...


In [53]:
news.type.unique()

array(['News article', 'Press release', 'Supplementary information'],
      dtype=object)

In [54]:
news[news.type=="Press release"]

Unnamed: 0,date,title,desc,type,link
114,2024-07-22 12:00:00+00:00,Commission publishes new guidelines for more c...,While more citizens of the European Union are ...,Press release,/news-events/news/commission-publishes-new-gui...


In [55]:
news[news.type=="Supplementary information"]

Unnamed: 0,date,title,desc,type,link
23,2023-10-19 12:00:00+00:00,Q&A – Assistance to passengers for flight disr...,Q&A,Supplementary information,/news-events/news/qa-assistance-passengers-fli...


Let's only keep the news.

In [56]:
news = news[news.type=="News article"]

In [57]:
news = news.reset_index(drop=True) # reset the index

In [58]:
news

Unnamed: 0,date,title,desc,type,link
0,2024-10-02 12:00:00+00:00,Single European Sky: annual monitoring highlig...,The Performance Review Body (PRB) has publishe...,News article,/news-events/news/single-european-sky-annual-m...
1,2024-09-25 12:00:00+00:00,Commission seeks feedback on the Flight Emissi...,The Commission has launched a public consultat...,News article,/news-events/news/commission-seeks-feedback-fl...
2,2024-09-25 12:00:00+00:00,Women in Rail 2024 Award to boost female talen...,The Commission together with Europe’s Rail Joi...,News article,/news-events/news/women-rail-2024-award-boost-...
3,2024-09-24 12:00:00+00:00,CEF Transport: €2.5 billion to boost resilienc...,The 2024 CEF for Transport call for proposals ...,News article,/news-events/news/cef-transport-eu25-billion-b...
4,2024-09-23 12:00:00+00:00,Rail market opening: competition leads to lowe...,A study released by the European Commission hi...,News article,/news-events/news/rail-market-opening-competit...
...,...,...,...,...,...
155,2023-01-31 12:00:00+00:00,Stakeholder conference on mitigating the socia...,The Commission plans to issue a Recommendation...,News article,/news-events/news/stakeholder-conference-mitig...
156,2023-01-30 12:00:00+00:00,CINEA launches a new public dashboard covering...,"The European Climate, Infrastructure and Envir...",News article,https://cinea.ec.europa.eu/news-events/news/ci...
157,2023-01-26 12:00:00+00:00,January Infringements package: key decisions,January Infringements package: key decisions,News article,https://ec.europa.eu/commission/presscorner/de...
158,2023-01-26 12:00:00+00:00,New EU rules on dedicated airspace for drones ...,"As of today, EU rules establishing a dedicated...",News article,/news-events/news/new-eu-rules-dedicated-airsp...


In [59]:
news = news[["date", "title", "desc"]] # new df with selected columns

In [60]:
news

Unnamed: 0,date,title,desc
0,2024-10-02 12:00:00+00:00,Single European Sky: annual monitoring highlig...,The Performance Review Body (PRB) has publishe...
1,2024-09-25 12:00:00+00:00,Commission seeks feedback on the Flight Emissi...,The Commission has launched a public consultat...
2,2024-09-25 12:00:00+00:00,Women in Rail 2024 Award to boost female talen...,The Commission together with Europe’s Rail Joi...
3,2024-09-24 12:00:00+00:00,CEF Transport: €2.5 billion to boost resilienc...,The 2024 CEF for Transport call for proposals ...
4,2024-09-23 12:00:00+00:00,Rail market opening: competition leads to lowe...,A study released by the European Commission hi...
...,...,...,...
155,2023-01-31 12:00:00+00:00,Stakeholder conference on mitigating the socia...,The Commission plans to issue a Recommendation...
156,2023-01-30 12:00:00+00:00,CINEA launches a new public dashboard covering...,"The European Climate, Infrastructure and Envir..."
157,2023-01-26 12:00:00+00:00,January Infringements package: key decisions,January Infringements package: key decisions
158,2023-01-26 12:00:00+00:00,New EU rules on dedicated airspace for drones ...,"As of today, EU rules establishing a dedicated..."


In [61]:
news.to_csv("news_clean.csv", index=False) # save to csv

&#x2753;**<font color=cornflowerblue>QUESTION: </font>** What are the common tasks involved in data wrangling?

&#x270A; **<font color=firebrick>DO THIS: </font>** Here are two options to practice what we've learned and to further develop your data skills:

1. Do some data wrangling on the data you scraped last week.
2. Explore the existing `news_clean.csv` dataset. What research question can it answer? Using what methods/tools?

Feel free to pick one task or do both. Have fun!

In [None]:
# put your code here













---------
### Congratulations, we are done!

This notebook is written by [Meng Cai](https://www.verkehr.tu-darmstadt.de/vv/das_institut_ivv/team_ivv/wissenschaftliche_mitarbeiter_doktoranden/meng_cai/standardseite_204.de.jsp), Technical University of Darmstadt. This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.
<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a>