<font style='font-size:1.5em'>**👨🏻‍🏫 Week 04 lecture – Web Scraping** </font>

<font style='font-size:1.2em'>DS105W – Data for Data Science</font>

**AUTHORS:**  Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi)

**OBJECTIVE**: Learn how to collect data from the Web using Python packages

**LAST REVISION:** 8 February 2024




--- 

# Part I: ⚙️ The setup

You will need to install the requests and Scrapy packages in order to complete this lab. I will assume you have configured the virtual environment for this course as follows. 



Open the terminal (directly from within VS Code will be easier) and run each of the following commands:


```bash
pip install pandas requests scrapy
```


In [1]:
import requests               # This is how we access the web
import pandas as pd           # This is how we work with data frames

from pprint import pprint     # Print things in a pretty way
from scrapy import Selector   # This is how we parse HTML


# Part II: Requesting a web page


You might have heard of [CIVICA](https://www.civica.eu/who-we-are/about-civica/) before. It is a body that unites several European universities to collaborate in the areas of social sciences, humanities, business and public policy. CIVICA hosts [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) that might be of interest to you. Today we will collect information on some of the seminars. Maybe you can use it in the future! 

**Our main task** is to create a 🐼 pandas data frame that would contain:

1. names of the seminars 
2. names of speakers of those seminars
3. dates of the seminars
4. bios of the speakers from each individual event 


## 2.1. Request a website


In [41]:
# This is the address of the website we want to scrape
my_url = 'https://socialdatascience.network/index.html#schedule'

# We set a GET request to the website
response = requests.get(my_url)

# What is the response code?
response

<Response [200]>

**📜 Other possible responses**

The response code is standard way of communicating the status of a request. There are many other possible responses:

- **200** OK
- **204** No Content
- **400** Bad Request
- **401** Unauthorized
- **402** Payment Required
- **403** Forbidden
- **404** Not Found
- **500** Internal Server Error
- **502** Bad Gateway

🗣️ **CLASSROOM DISCUSSION:** Have you ever encountered any of these responses when browsing the Web on your browser? Where? What did you do about it?


You can find a full list [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

## 2.2. A closer look at the response

What else is stored in the `response` object?

In [4]:
# The vars function returns all attributes of an object, along with their values
# You will see that it is essentially just a dictionary
vars(response)

 '_content_consumed': True,
 '_next': None,
 'status_code': 200,
 'headers': {'Connection': 'keep-alive', 'Content-Length': '18651', 'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'Last-Modified': 'Tue, 06 Feb 2024 10:58:31 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"65c210d7-1c1c6"', 'expires': 'Thu, 08 Feb 2024 15:18:24 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': 'A5B0:FE42:23F7EF4:24BA65A:65C4EE68', 'Accept-Ranges': 'bytes', 'Date': 'Thu, 08 Feb 2024 17:36:59 GMT', 'Via': '1.1 varnish', 'Age': '48', 'X-Served-By': 'cache-lcy-eglc8600065-LCY', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1707413819.335317,VS0,VE1', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '97ec27249c9ebced1297496006308adcd2b836d5'},
 'raw': <urllib3.response.HTTPResponse at 0x18ea64c2a40>,
 'url': 'https://socialdatascience.network/index.html#schedule',
 'encoding': 'utf-8',
 'history': [],
 'reason': 

🗣️ **CLASSROOM DISCUSSION:**

You have already looked at `response.status_code`. But what do you think the following attributes of the `response` object are?

- `response.headers`
- `response.cookies`
- `response.content`

Feel free to open a new chunk of code below and explore these attributes.

But encoding is not the only **metadata** we can get from the response. Let's take a look at all the headers:

In [42]:
# Headers are metadata about the response
print(response.headers)

{'Connection': 'keep-alive', 'Content-Length': '18651', 'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'Last-Modified': 'Tue, 06 Feb 2024 10:58:31 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"65c210d7-1c1c6"', 'expires': 'Sat, 10 Feb 2024 18:37:07 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': '78A4:4AF44:492A5DE:4AC7DA0:65C7BFFB', 'Accept-Ranges': 'bytes', 'Date': 'Sat, 10 Feb 2024 18:27:08 GMT', 'Via': '1.1 varnish', 'Age': '0', 'X-Served-By': 'cache-lhr7331-LHR', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1707589628.893803,VS0,VE126', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '3fea3040d5bff08ef65e47b842101a93defded27'}


We could choose to manipulate the headers above as a `pd.Series`:

In [6]:
pd.Series(response.headers)

Connection                                                   keep-alive
Content-Length                                                    18651
Server                                                       GitHub.com
Content-Type                                   text/html; charset=utf-8
Last-Modified                             Tue, 06 Feb 2024 10:58:31 GMT
Access-Control-Allow-Origin                                           *
ETag                                                 W/"65c210d7-1c1c6"
expires                                   Thu, 08 Feb 2024 15:18:24 GMT
Cache-Control                                               max-age=600
Content-Encoding                                                   gzip
x-proxy-cache                                                      MISS
X-GitHub-Request-Id                  A5B0:FE42:23F7EF4:24BA65A:65C4EE68
Accept-Ranges                                                     bytes
Date                                      Thu, 08 Feb 2024 17:36

Let me know you what is in the object `response` by printing it.

In [43]:
print(response.text)

<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="utf-8">
  <title>CIVICA Data Science Seminar</title>
  <meta content="width=device-width, initial-scale=1.0" name="viewport">
  <meta content="CIVICA Data Science Seminar" name="keywords">
  <meta content="A series of data science workshops and seminars" name="description">

  <!-- Favicons -->
  <link href="img/c-favicon.png" rel="icon">
  <link href="img/apple-touch-icon.png" rel="apple-touch-icon">

  <!-- Google Fonts -->
  <link href="https://fonts.googleapis.com/css?family=Open+Sans:300,300i,400,400i,700,700i|Raleway:300,400,500,700,800" rel="stylesheet">

  <!-- Bootstrap CSS File -->
  <link href="lib/bootstrap/css/bootstrap.min.css" rel="stylesheet">

  <!-- Libraries CSS Files -->
  <link href="lib/font-awesome/css/font-awesome.min.css" rel="stylesheet">
  <link href="lib/animate/animate.min.css" rel="stylesheet">
  <link href="lib/venobox/venobox.css" rel="stylesheet">
  <link href="lib/owlcarousel/assets/owl.carousel

The code chunk above makes sense here because I want to show you how to inspect objects when in **prototype mode**. However, whenever you are writing in a Jupyter Notebook to report to someone (say, when submitting your assignment), you should remove code chunks that produce a lot of unnecessary output.


💡 **A DETAIL THAT SEEMS INSIGNIFICANT BUT THAT IS EXTREMELY IMPORTANT**: 
- If you are on Mac or Linux, you will find that the break line character is `\n`. 
- If you are on Windows, you will find that the break line character is `\r\n`. 
- Windows uses two characters to break lines, while Mac and Linux use only one. 
- This is a common source of errors when working with with text files in two different OS. (For example: you use Mac and collaborate with someone who uses Windows.)

How many characters are there in the `response.text`?

In [8]:
len(response.text)

115124

Not very useful to treat is as pure string, right? We need to find a better way to parse this data.


## 2.3. Parsing HTML

The Scrapy Selector package is a Python library for extracting data from HTML and XML documents. It uses CSS or XPath selectors for data extraction making it a powerful tool for web scraping. It is often an essential part of the Scrapy framework but can also be used independently.

When you feed HTML text to the Scrapy Selector, it processes the HTML and preserves it in a particular **object**. This object allows you to access parts of the HTML using Python's common dot notation in combination with the CSS syntax. If, for instance, you want to fetch the title of the page, you might use `selector.css('title')`.


In [56]:
sel.css('title')


[<Selector xpath='descendant-or-self::title' data='<title>CIVICA Data Science Seminar</t...'>]

In [60]:
#Get first title
title_text = sel.css('title::text').get()
print(title_text)


CIVICA Data Science Seminar


In [45]:
# parse the HTML code using Scrapy Selector
sel = Selector(text=response.text)
print(sel)

<Selector xpath=None data='<html lang="en">\r\n\r\n<head>\r\n  <meta c...'>


💡 Note: I was only able to call `Selector()` directly because I had already imported it at the top of the notebook. Scroll up to see it. If I hadn't, the code above would have thrown an error.

**Check `sel.get()` to see the full HTML document**

This has the same effect as `response.text`.

In [13]:
print(sel.get())

<html lang="en">

<head>
  <meta charset="utf-8">
  <title>CIVICA Data Science Seminar</title>
  <meta content="width=device-width, initial-scale=1.0" name="viewport">
  <meta content="CIVICA Data Science Seminar" name="keywords">
  <meta content="A series of data science workshops and seminars" name="description">

  <!-- Favicons -->
  <link href="img/c-favicon.png" rel="icon">
  <link href="img/apple-touch-icon.png" rel="apple-touch-icon">

  <!-- Google Fonts -->
  <link href="https://fonts.googleapis.com/css?family=Open+Sans:300,300i,400,400i,700,700i%7CRaleway:300,400,500,700,800" rel="stylesheet">

  <!-- Bootstrap CSS File -->
  <link href="lib/bootstrap/css/bootstrap.min.css" rel="stylesheet">

  <!-- Libraries CSS Files -->
  <link href="lib/font-awesome/css/font-awesome.min.css" rel="stylesheet">
  <link href="lib/animate/animate.min.css" rel="stylesheet">
  <link href="lib/venobox/venobox.css" rel="stylesheet">
  <link href="lib/owlcarousel/assets/owl.carousel.min.css" rel=

**HTML documents usually have a \<header\> tag:**

(⚠️ not to be confused with the HTTP header we saw with `response.headers`)

In [14]:
sel.css('header')

[<Selector xpath='descendant-or-self::header' data='<header id="header">\r\n    <div class=...'>]

In [16]:
sel.css('div')

[<Selector xpath='descendant-or-self::div' data='<div class="container">\r\n\r\n      <div...'>,
 <Selector xpath='descendant-or-self::div' data='<div id="logo" class="pull-left">\r\n  ...'>,
 <Selector xpath='descendant-or-self::div' data='<div class="intro-container wow fadeI...'>,
 <Selector xpath='descendant-or-self::div' data='<div style="width: 500px">\r\n         ...'>,
 <Selector xpath='descendant-or-self::div' data='<div id="first_assistant_desc" style=...'>,
 <Selector xpath='descendant-or-self::div' data='<div class="lds-ellipsis" id="loading...'>,
 <Selector xpath='descendant-or-self::div' data='<div></div>'>,
 <Selector xpath='descendant-or-self::div' data='<div></div>'>,
 <Selector xpath='descendant-or-self::div' data='<div></div>'>,
 <Selector xpath='descendant-or-self::div' data='<div></div>'>,
 <Selector xpath='descendant-or-self::div' data='<div id="output"></div>'>,
 <Selector xpath='descendant-or-self::div' data='<div class="text-center"><button type...'>,
 <Selecto

There is also usually a `<body>` tag, which contains the main content of the page:

In [15]:
sel.css('body')



The `.card` represents that the type of division to scrape should be a `card`.

In [18]:
cards = sel.css('div.card')
len(cards)

39

In [21]:
print(cards[0].extract())

<div class="card mb-4">

    <!--Card image-->
    <div class="view overlay">
      <a href="spring2024/sess1.html"><img class="card-img-top" src="spring2024/visuals/sess1/sess1.png" alt="data science"></a>
      <a href="spring2024/sess1.html">
        <div class="mask rgba-white-slight"></div>
      </a>
    </div>

    <!--Card content-->
    <div class="card-body">

      <!--Title-->
      <a href="spring2024/sess1.html"><h6 class="card-title">Misinformation exposure beyond traditional feeds: Evidence from a WhatsApp deactivation experiment in Brazil</h6></a>
      <!--Text-->
      <p class="card-text">Speaker: Prof. Tiago Ventura, Georgetown University <br> Date: Wednesday, 07 February 2024</p>
      <!-- Provides extra visual weight and identifies the primary action in a set of buttons -->
      <a href="spring2024/sess1.html"><button type="button" class="btn btn-light-blue btn-md">Read more</button></a>

    </div>

  </div>


In [22]:
#This is a container:
cards[0].extract()

'<div class="card mb-4">\r\n\r\n    <!--Card image-->\r\n    <div class="view overlay">\r\n      <a href="spring2024/sess1.html"><img class="card-img-top" src="spring2024/visuals/sess1/sess1.png" alt="data science"></a>\r\n      <a href="spring2024/sess1.html">\r\n        <div class="mask rgba-white-slight"></div>\r\n      </a>\r\n    </div>\r\n\r\n    <!--Card content-->\r\n    <div class="card-body">\r\n\r\n      <!--Title-->\r\n      <a href="spring2024/sess1.html"><h6 class="card-title">Misinformation exposure beyond traditional feeds: Evidence from a WhatsApp deactivation experiment in Brazil</h6></a>\r\n      <!--Text-->\r\n      <p class="card-text">Speaker: Prof. Tiago Ventura, Georgetown University <br> Date: Wednesday, 07 February 2024</p>\r\n      <!-- Provides extra visual weight and identifies the primary action in a set of buttons -->\r\n      <a href="spring2024/sess1.html"><button type="button" class="btn btn-light-blue btn-md">Read more</button></a>\r\n\r\n    </div>\r

In [23]:
#This is a container: The below returns a list. h6 is a header
cards[0].css("h6").extract()

['<h6 class="card-title">Misinformation exposure beyond traditional feeds: Evidence from a WhatsApp deactivation experiment in Brazil</h6>']

In [34]:
#This is a container: The below extracts the text of the headline of the first card. h6 is a header
cards[0].css("h6 ::text").extract()

['Misinformation exposure beyond traditional feeds: Evidence from a WhatsApp deactivation experiment in Brazil']

In [36]:
def get_title(card):
    return card.css("h6 ::text").extract_first()

In [62]:
all_titles = []
for i in range(len(cards)):
    all_titles.append(get_title(cards[i]))

all_titles

['Misinformation exposure beyond traditional feeds: Evidence from a WhatsApp deactivation experiment in Brazil',
 'Promoting the systematic use of real-world data and real-world evidence for digital health technologies across Europe: A consensus framework',
 'Data science for the Sustainable Development Goals: the case of food security',
 'CentralBankRoBERTa: A Fine-Tuned Large Language Model for Central Bank Communications',
 'The Evolution of the Climate Discourse on Twitter: Polarization, Hypocrisy, and the Musk Takeover',
 'The Handbook of Computational Social Science for Policy',
 'Artificial Intelligence, Algorithmic Recommendations and Competition',
 'Exploring A New Model of Industry/Academic Collaboration: the U.S. 2020 Facebook and Instagram Election Study',
 'Using Multimodal Neural Networks to Better Understand How Voters Process Audiovisual Information',
 "Models, mathematics, and data science: how to make sure we're answering the right questions",
 'CIVICA Conference on E

In [38]:
all_titles = [get_title(card) for card in cards]
all_titles

['Misinformation exposure beyond traditional feeds: Evidence from a WhatsApp deactivation experiment in Brazil',
 'Promoting the systematic use of real-world data and real-world evidence for digital health technologies across Europe: A consensus framework',
 'Data science for the Sustainable Development Goals: the case of food security',
 'CentralBankRoBERTa: A Fine-Tuned Large Language Model for Central Bank Communications',
 'The Evolution of the Climate Discourse on Twitter: Polarization, Hypocrisy, and the Musk Takeover',
 'The Handbook of Computational Social Science for Policy',
 'Artificial Intelligence, Algorithmic Recommendations and Competition',
 'Exploring A New Model of Industry/Academic Collaboration: the U.S. 2020 Facebook and Instagram Election Study',
 'Using Multimodal Neural Networks to Better Understand How Voters Process Audiovisual Information',
 "Models, mathematics, and data science: how to make sure we're answering the right questions",
 'CIVICA Conference on E

In [25]:
#This is a container: The below returns a list. a is a type that represents links
cards[1].css("a").extract()[0]

'<a href="spring2024/sess1.html"><img class="card-img-top" src="spring2024/visuals/sess1/sess1.png" alt="data science"></a>'

In [35]:
#This is a container: The below returns a list. a is a type that represents links
cards[1].css("a ::text").extract()[0]

'\r\n        '

🔑 **Takeaway of the output above:**

- The output is a list, as indicated by the square brackets. 
- HTML pages only have one `<body>` tag, so this list contains a single element, which is an object of the class Selector.

What if I want to look at the content of the `<body>` tag?

In [63]:
pprint(sel.css('body').get())

('<body>\r\n'
 '\r\n'
 '    Header\r\n'
 '  <header id="header">\r\n'
 '    <div class="container">\r\n'
 '\r\n'
 '      <div id="logo" class="pull-left">\r\n'
 '        <!-- Uncomment below if you prefer to use a text logo -->\r\n'
 '        <!-- <h1><a href="#main">C<span>o</span>nf</a></h1>-->\r\n'
 '        <a href="#intro" class="scrollto"><img src="img/logo.png" alt="" '
 'title=""></a>\r\n'
 '      </div>\r\n'
 '\r\n'
 '      <nav id="nav-menu-container">\r\n'
 '        <ul class="nav-menu">\r\n'
 '          <li class="menu-active"><a href="#intro">Home</a></li>\r\n'
 '          <li><a href="#about">About</a></li>\r\n'
 '<!--           <li><a href="#speakers">Speakers</a></li>\r\n'
 ' -->          <li><a href="#schedule">Schedule</a></li>\r\n'
 '          <li><a href="#supporters">Partner Institutions</a></li>\r\n'
 '          <li><a href="summerschool.html">Summer School</a></li>\r\n'
 '          <li><a href="#gallery">Gallery</a></li>\r\n'
 '          <li><a href="#contact">Co

**Are there any `<h1>` tags in this page?**

In [64]:
sel.css('h1').get()

'<h1 class="mb-4 pb-0">CIVICA<br><span>Data Science</span> Seminar Series</h1>'

What about `<h2>` tags?

In [65]:
sel.css('h2').getall()

['<h2>Seminar Schedule</h2>',
 '<h2>Partner Institutions</h2>',
 '<h2>Gallery</h2>',
 '<h2>F.A.Q </h2>',
 '<h2>Newsletter</h2>',
 '<h2>Contact Us</h2>']

If you care just about the **first** `<h2>` tag, you can use the `.get()` method instead of `.getall()`:

In [66]:
sel.css("h2").get()

'<h2>Seminar Schedule</h2>'

**How to get the text from a tag:**

In [67]:
sel.css("h2 ::text").get()

'Seminar Schedule'

**How to get the text of tags returned by the `.css()` method?**

You can also use `::text` on each tag element within the CSS selector returned by the `css()` method.


In [68]:
# Pure Python way
all_h2_tags = sel.css("h2 ::text").getall()
all_h2_texts = []

for tag in all_h2_tags:
    all_h2_texts.append(tag)

all_h2_texts

['Seminar Schedule',
 'Partner Institutions',
 'Gallery',
 'F.A.Q ',
 'Newsletter',
 'Contact Us']

**Consider using [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) for a cleaner code:**

In [69]:
# one-liner way
all_h2_texts = [tag.get() for tag in sel.css("h2 ::text")]
all_h2_texts

['Seminar Schedule',
 'Partner Institutions',
 'Gallery',
 'F.A.Q ',
 'Newsletter',
 'Contact Us']

💡 **IMPORTANT TIPS:**

- Make it a habit in the next couple of weeks to every now and then, right-click on a webpage and select "Inspect" (or "Inspect Element") to explore how the HTML is structured. This will help you understand how to use CSS selectors to extract the data you need.
- Tag names and ` ::text` are just the tip of the iceberg. Read about other CSS selectors [here](https://www.w3schools.com/cssref/css_selectors.asp).
- Bookmark the [Scrapy Selectors documentation page](https://docs.scrapy.org/en/latest/topics/selectors.html) and revisit it whenever you need to. Practice using different CSS selectors to extract data from the HTML.

# Part III: Your turn!

Let's make this a dynamic and interactive lecture.


🎯 **ACTION POINTS**

1. Go to the [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) website and inspect the page (mouse right-click + Inspect) and find the way to the name of the first event on the page. 

3. Write down the "directions" inside the HTML file to reach the event title. For example, maybe you will find that:

    > _The first event title is inside a \<html\> ➡️ \<div\> ➡️ \<div\> ➡️ \<h3\> tag_.

    Write it in the markdown cell below:

_Delete this line and write your answer here_

> _The first event title is inside a \<html\> ➡️ \<body\> ➡️ \<div\> ➡️\<div\> ➡️\<div\> ➡️\<div\> ➡️\<div\> ➡️\<a\> ➡️ \<h6\> tag_.

4. Now, use the skill that you have just learned to scrape the names of ALL events. Save them all to a list.

In [70]:
# Delete this line and replace it with your code
all_titles = [get_title(card) for card in cards]
all_titles

['Misinformation exposure beyond traditional feeds: Evidence from a WhatsApp deactivation experiment in Brazil',
 'Promoting the systematic use of real-world data and real-world evidence for digital health technologies across Europe: A consensus framework',
 'Data science for the Sustainable Development Goals: the case of food security',
 'CentralBankRoBERTa: A Fine-Tuned Large Language Model for Central Bank Communications',
 'The Evolution of the Climate Discourse on Twitter: Polarization, Hypocrisy, and the Musk Takeover',
 'The Handbook of Computational Social Science for Policy',
 'Artificial Intelligence, Algorithmic Recommendations and Competition',
 'Exploring A New Model of Industry/Academic Collaboration: the U.S. 2020 Facebook and Instagram Election Study',
 'Using Multimodal Neural Networks to Better Understand How Voters Process Audiovisual Information',
 "Models, mathematics, and data science: how to make sure we're answering the right questions",
 'CIVICA Conference on E

5. Do the same with the dates of the events and speaker names and save them to separate lists. 



In [86]:
# Delete this line and replace it with your code
all_dates = []
for card in cards:
    # Get the card text
    card_text = card.css('.card-body p.card-text').get()
    #print("Card Text:", card_text)  # Add debug statement
    date_text = card.xpath('.//br/following-sibling::text()').get()
    # Append the date to the list
    all_dates.append(date_text.strip() if date_text else None)

all_dates



['Date: Wednesday, 07 February 2024',
 'Date: Wednesday, 22 November 2023',
 'Date: Wednesday, 18 October 2023',
 'Date: Wednesday, 27 September 2023',
 'Date: Wednesday, 13 September 2023',
 'Date: Wednesday, 31 May 2023',
 'Date: Wednesday, 03 May 2023',
 'Date: Wednesday, 19 April 2023',
 'Date: Wednesday, 22 March 2023',
 'Date: Wednesday, 08 March 2023',
 'Date: Wednesday, 15 February 2023',
 'Date: Wednesday, 08 February 2023',
 'Date: Wednesday, 11 January 2023',
 'Date: Wednesday, 02 November 2022',
 'Date: Wednesday, 19 October 2022',
 'Date: Wednesday, 14 September 2022',
 'Date: Wednesday, 15 June 2022',
 'Date: Wednesday, 01 June 2022',
 'Date: Wednesday, 04 May 2022',
 'Date: Wednesday, 20 April 2022',
 'Date: Wednesday, 09 March 2022',
 'Date: Wednesday, 23 February 2022',
 'Date: Wednesday, 09 February 2022',
 'Date: Wednesday, 26 January 2022',
 'Date: Wednesday, 12 January 2022',
 'Date: Wednesday, 1 December 2021',
 'Date: Wednesday, 3 November 2021',
 'Date: Wednesda

6. Combine all of these the lists into a pandas data frame and save it to a CSV file.

In [89]:
# Delete this line and replace it with your code
import pandas as pd

# Combine all_titles and all_dates into a DataFrame
df = pd.DataFrame({
    'Title': all_titles,
    'Date': all_dates
})

# Display the DataFrame
print(df.head())


                                               Title  \
0  Misinformation exposure beyond traditional fee...   
1  Promoting the systematic use of real-world dat...   
2  Data science for the Sustainable Development G...   
3  CentralBankRoBERTa: A Fine-Tuned Large Languag...   
4  The Evolution of the Climate Discourse on Twit...   

                                 Date  
0   Date: Wednesday, 07 February 2024  
1   Date: Wednesday, 22 November 2023  
2    Date: Wednesday, 18 October 2023  
3  Date: Wednesday, 27 September 2023  
4  Date: Wednesday, 13 September 2023  


In [90]:
# Save the DataFrame to a CSV file
df.to_csv('take_home_tasks_week_4.csv', index=False)


7. Double-check that the CSV file was created correctly by opening it using pandas. Then convert the columns to appropriate data types.

In [92]:
# Delete this line and replace it with your code
check = pd.read_csv('take_home_tasks_week_4.csv')
check.head()

Unnamed: 0,Title,Date
0,Misinformation exposure beyond traditional fee...,"Date: Wednesday, 07 February 2024"
1,Promoting the systematic use of real-world dat...,"Date: Wednesday, 22 November 2023"
2,Data science for the Sustainable Development G...,"Date: Wednesday, 18 October 2023"
3,CentralBankRoBERTa: A Fine-Tuned Large Languag...,"Date: Wednesday, 27 September 2023"
4,The Evolution of the Climate Discourse on Twit...,"Date: Wednesday, 13 September 2023"


# Part IV: Why convert the scraped data to a data frame? 

If time allows, we will play around with the data frame in pandas to learn a few other tricks.