<img src="../Images/DSC_Logo.png" style="width: 400px;">

# Data Handling in Python

This notebook introduces fundamental data handling techniques in Python, with a focus on working with qualitative data. It covers how to import data from external files and write data back to files, and how to use Python libraries for data handling. 

## 1. Reading and Writing Text Files

Up until now, we’ve worked only with data created inside our own code. But in real-world projects, you’ll mainly work with existing data and files. Let's see how to open and read text files, as well as how to save (write) new files.

We can use the `with` operator **to open a text file**:

In [None]:
# Open the file "Interview1.txt" (from a relative path) in read mode ("r"):
with open("../Data/Interviews_ClimateChange/Interview1.txt", "r") as f: 
    data = f.read()

print(data)

This reads the contents of .txt file and stores it in the variable "data".

**To save text** into a new file, we use almost the same structure, but change the mode to "w" (write):

In [None]:
new_string = "This is a new interview."

# Save the file "Interview4.txt" (to a relative path) in write mode ("w"):
with open("../Data/Interviews_ClimateChange/Interview4.txt", "w") as f:
    f.write(new_string)

Now, a file called Interview4.txt will be created (or overwritten) with the text from "new_string".

## 2 Using Libraries for Data Handling

In this section, we introduce a few Python libraries that are useful for basic work with qualitative data. These libraries help with importing, editing, and organizing data for analysis, especially when dealing with formats like text or structured files (e.g. JSON, CSV). 

## 2.1 `json`

The `json` library helps you work with **nested (hierarchical) data** - data that contains lists, dictionaries, or other structures inside it. JSON stands for JavaScript Object Notation. It’s a common way to store and send data on the web. You’ll see it used a lot when working with **websites and APIs**.

Let’s **load** a few entries from a real-world dataset of HuffPost news articles that contains thousands of news headlines from 2012 to 2022, where each line is a separate article in JSON format:

In [None]:
# Import the json library (part of Python’s standard library)
import json

In [None]:
# Path to the dataset file (relative path from current notebook)
file_path = "../Data/sample_articles_10000.json"

# Create an empty list to store the loaded articles
news_articles = []

# Open the file in read mode ("r")
with open(file_path, "r") as f:
    # Loop over each line in the file (each line is a separate news article in JSON format)
    for i, line in enumerate(f):
        if i >= 5:  # Stop after reading the first 5 articles (just for demonstration)
            break
        # Convert the JSON string (one line) into a Python dictionary and add it to the list
        news_articles.append(json.loads(line))

# Loop through the loaded articles and print selected fields
for article in news_articles:
    # Access and print the author's name, category, and headline using dictionary keys
    print(f"{article['authors']} ({article['category']}): {article['headline']}")

This is how we can **write** a JSON-style data structure from scratch in Python: We define a dictionary to represent a single news article, including keys like "link", "headline", "category", "authors", and "date". If we have multiple articles, we can store them in a list of dictionaries.

In [None]:
news_articles = [
    {
        "link": "https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9",
        "headline": "Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters",
        "category": "U.S. NEWS",
        "short_description": "Health experts said it is too early to predict whether demand would match up...",
        "authors": "Carla K. Johnson, AP",
        "date": "2022-09-23"
    },
    # more articles could go here
]

for article in news_articles:
    print(f"{article['authors']} ({article['category']}): {article['headline']}")

Here is how we can **load** only the headlines and **write** them to a single text file:

In [None]:
file_path = "../Data/sample_articles_10000.json"
output_path = "../Data/all_headlines.txt"

# Open the dataset and extract headlines
headlines = []
with open(file_path, "r") as f:
    for line in f:
        article = json.loads(line)
        headline = article.get("headline", "").strip()
        if headline:
            headlines.append(headline)

# Write all headlines to a single text file, one per line
with open(output_path, "w", encoding="utf-8") as f_out:
    for headline in headlines:
        f_out.write(headline + "\n")

## 2.2 `glob`

Suppose you have a folder with multiple .txt files - each one is a transcript of a different interview. You want to automatically load all these files to analyze them in Python. The `glob` library allows you to search for files in a folder based on patterns. In this example, we’ll load all .txt files from a folder and print their contents.

In [None]:
# Import the glob library (part of Python’s standard library)
import glob

# Get all text files in the "Interviews_ClimateChange" folder
files = glob.glob("../Data/Interviews_ClimateChange/*.txt")

# Read and display contents of each file
for filepath in files:
    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()
        print(f"--- Contents of {filepath} ---")
        print(content)
        print("\n")

## 2.3 `beautifulsoup4` & `requests`

`beautifulsoup4` is a library used to parse HTML (HyperText Markup Language) and extract information. It’s perfect for getting data from websites in a structured, readable way. Together with `requests` we can conduct our first web scraping task. Let`s scrape some text from a Wikipedia page!

In [None]:
# Install & import the libraries
!pip install beautifulsoup4 
!pip install requests
from bs4 import BeautifulSoup
import requests

In [None]:
# Define the URL of the Wikipedia page you want to fetch
url = "https://en.wikipedia.org/wiki/Democracy"

# Send an HTTP GET request to the URL and store the response
response = requests.get(url)

# Parse the HTML content of the page using BeautifulSoup
# "html.parser" tells BeautifulSoup to interpret the content as standard HTML
soup = BeautifulSoup(response.content, "html.parser")

# Find the main content section of the Wikipedia article
# Wikipedia stores article content inside a <div> tag with the ID "mw-content-text"
body = soup.find("div", {"id": "mw-content-text"})

# Find all <p> tags (paragraphs) within that section and extract their text
# Only include paragraphs that are not empty
paragraphs = [p.text for p in body.find_all("p") if p.text.strip()]

# Join all the paragraphs into one large text block
main_text = " ".join(paragraphs) # " ".join(): joins a list of strings into one single string, with a space (" ") placed between each element

# Print the first 500 characters of the article body
print(main_text[:500])

## 2.4 `pandas`

`pandas` is the most widely used Python library for working with tabular data - data arranged in rows and columns - commonly found in files like .csv or Excel spreadsheets. 

`pandas` makes it easy to read files like `.csv` into `DataFrames`, which keep both the data and its structure intact. This allows you to efficiently explore, organize, and analyze data within your code. `DataFrames` are similar to Excel spreadsheets or database tables. They have a 2-dimensional data structure and labeled axes (rows and columns). These are indexed for efficient data retrieval.

<img src="../Images/dataframe.png" style="width: 300px;">

In [None]:
# Install & import pandas
!pip install pandas
import pandas as pd

Let's look at some basic commands with `pandas`:

- To **create a DataFrame** from a dictionary:

In [None]:
names_dict = {"names": ["Alice", "Mary", "Kim", "Deniz", "Carla", "Linus"]}
df = pd.DataFrame(names_dict)

Simply type `df` or use `print(df)` to display the DataFrame.

In [None]:
print(df)

- **Saving** a DataFrame to CSV:

In [None]:
df.to_csv("../Data/names.csv", index=False)

Setting `index=False` prevents pandas from writing row indices to the file.

- **Reading** a DataFrame from CSV:

In [None]:
Titanic = pd.read_csv("../Data/Titanic-Dataset.csv")

- **Quick inspection** - getting to know the DataFrame:

In [None]:
Titanic.head() # Preview (default: 5 rows)

In [None]:
Titanic.head(10) # Preview 10 rows

In [None]:
Titanic.describe() # Summarize the statistical properties of numerical data

- **Adding new column** to a DataFrame:

In [None]:
df["ages"] = [20, 26, 20, 18, 52, 40]

Ensure the new column has the same number of entries as existing rows.

- **Accessing** a specific **column**:

In [None]:
ages = df["ages"]
# or:
ages = df.ages
print(ages)

- **Converting** a column to a **list**:

In [None]:
df["ages"].tolist()

- Isolating **unique values** in a column:

In [None]:
df["ages"].unique()

- **Accessing** a specific **row** with iloc:

In [None]:
df.iloc[1]

- **Sort by** one or more columns:

In [None]:
df.sort_values("ages", ascending=False) # ascending=False tells pandas to sort values from highest to lowest instead of the default (lowest to highest)

- **Reset index**:

In [None]:
df.reset_index(drop=True)

### **Exercise 1:** 

Load the World Happiness Report 2024 from `../Data/World-happiness-report-2024.csv` using `pandas`. Sort the data table by "Ladder score" and print the top 5 countries with the highest "Ladder score".

In [None]:
import pandas as pd

# Load the dataset
df2 = pd.read_csv("../Data/World-happiness-report-2024.csv")

# Sort df
df2.sort_values("Ladder score", ascending=False)

# Print top 5 happiest countries (top 10 rows)
df2[["Country name", "Ladder score"]].head(10)

- **Iterating** over a DataFrame with `iterrows()`:

In [None]:
for idx, row in df.iterrows():
    print(row["names"], row["ages"])

- **Select rows** by conditions:

In [None]:
df.loc[df["ages"] <= 40]
# or:
df.query("ages > 30")

Add more filters by chaining conditions with `&` (and) or `|` (or).

- **Cleaning** the DataFrame:

Remove column:

In [None]:
df.drop(columns=["ages"])

Keep only a certain column:

In [None]:
df[["names"]]
# or
df.filter(["names"])

- Column **name pattern filtering**, for example:

In [None]:
df.filter(regex="^a")  # selects all columns starting with "a"

- Remove rows with **missing values**:

From a specific column:

In [None]:
df[df["ages"].notna()] 

Where all columns are empty (default: "any"):

In [None]:
df.dropna(how="all")

- **Convert data type** id needed (e.g. float to int):

In [None]:
df["ages"] = df["ages"].astype(int)

- **Searching text data**:

In [None]:
df[df["names"].str.contains("A")]

-  **Grouping Data by** one or more columns for aggregate analysis:

In [None]:
df.groupby("ages").count()

Chain with `.sum()`, `.mean()`, `.count()` etc., for numeric summaries.