## Week 5: File Reading, Writing, and Working with CSV Data Using `pandas`

- **Reading from text files** using `open()`, `.read()`, and `.readlines()` to understand file handling basics
- **Writing to text files** using `"w"` and `"a"` modes
- **Intro to CSV files**: what they are and why they’re essential in data analytics
- Using the `pandas` library to:
  - Load CSV data using `pd.read_csv()`
  - Explore and inspect data with `.head()`, `.info()`, and `.describe()`
  - Filter, compute, and manipulate columns (e.g., calculate average score)
  - Export results to a new CSV file using `to_csv()`
- Converting DataFrame columns to Python dictionaries with `.to_dict()`
- **Practice**:
  - Read a list of names and scores from a CSV using `pandas`
  - Calculate the average score
  - Filter students below average
  - Save the results to a new CSV file
  - Convert the filtered data to a dictionary

## Resources

- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Pandas Cheat Sheet (PDF)](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Real Python: Data Wrangling with Pandas](https://realpython.com/pandas-python-explore-dataset/)



## File Basics in Python: `open()`, `.read()`, `.readlines()`

For Data analytics, it's important to understand how Python reads and writes text files.

### 🔸 Reading from Text Files

To read a file, use the built-in `open()` function with `"r"` (read) mode.

```python
with open("example.txt", "r") as file:
    content = file.read()
    print(content)
```
**OR**
```py
with open("example.txt", "r") as file:
    lines = file.readlines()
    print(lines)
```
- **.read() vs .readlines()**
- .read() → Reads the entire file as a single string.
- .readlines() → Reads the file line by line into a list.



In [1]:
with open("story.txt", encoding="utf-8") as file:
    data = file.read()
data

'💰 A Story About CPF: The Three Jars\nOnce upon a time in Singapore, there was a young man named Alex who just landed his first full-time job. On his first payday, he was excited — until he noticed a portion of his salary was taken out.\n"Where did my money go?" he wondered.\nHis HR manager smiled and said, “That’s your CPF contribution. Think of it like this — we’re giving your future a head start.”\n🏠 The Ordinary Wages and the Three Jars\nThe money deducted went into three special jars — each with a purpose.\n1. 🧓 Jar 1: Special Account (SA) – This was for Alex’s retirement savings. The money would grow slowly over time, like a tree planted early.\n2. 🏡 Jar 2: Ordinary Account (OA) – This was for housing, education, or investment. Someday, Alex could use it to help buy his first flat.\n3. ❤️ Jar 3: Medisave Account (MA) – This was for medical expenses. If Alex or his family needed to see a doctor or pay for insurance, this jar had him covered.\nAs Alex grew older, he realized CPF wa

In [2]:
with open("story.txt", encoding="utf-8") as file:
    data = file.readlines()
print(data)

['💰 A Story About CPF: The Three Jars\n', 'Once upon a time in Singapore, there was a young man named Alex who just landed his first full-time job. On his first payday, he was excited — until he noticed a portion of his salary was taken out.\n', '"Where did my money go?" he wondered.\n', 'His HR manager smiled and said, “That’s your CPF contribution. Think of it like this — we’re giving your future a head start.”\n', '🏠 The Ordinary Wages and the Three Jars\n', 'The money deducted went into three special jars — each with a purpose.\n', '1. 🧓 Jar 1: Special Account (SA) – This was for Alex’s retirement savings. The money would grow slowly over time, like a tree planted early.\n', '2. 🏡 Jar 2: Ordinary Account (OA) – This was for housing, education, or investment. Someday, Alex could use it to help buy his first flat.\n', '3. ❤️ Jar 3: Medisave Account (MA) – This was for medical expenses. If Alex or his family needed to see a doctor or pay for insurance, this jar had him covered.\n', 'A

---

## 📝 File Basics in Python: `write()`

The `write()` function is used to write data to a file. It's part of Python’s built-in file handling system, and it allows you to create or modify the contents of a file.


In [6]:
with open("filename.txt", "w") as file:
    file.write("Hello, world!")

- "w" stands for write mode. If the file exists, it will overwrite the existing content. If it doesn't, a new file is created.

- Always use with open(...) to ensure the file is properly closed after writing.
- write() only accepts strings. If you want to write numbers or other types, convert them first:

In [5]:
score = 95
with open("filename.txt", "w") as file:
    file.write(str(score))

---

## ➕ File Basics in Python: `append (a)` Mode

The `"a"` mode in Python is used to **append data to the end of a file**. It is part of the file handling system and is useful when you want to add new content **without deleting existing content**.



In [7]:
with open("filename.txt", "a") as file:
    file.write("This new line will be added at the end.\n")

- "a" stands for append mode.

- If the file does not exist, it will be created.

- If the file does exist, new content will be added at the end without overwriting the original data.

---

## 📥 Loading CSV Files with `pandas.read_csv()`

The `read_csv()` function in `pandas` is the most common way to **load data from a CSV file** into a DataFrame — which is like an Excel table in Python.


In [None]:
# %pip install pandas
import pandas
data = pandas.read_csv("students_score.csv")

Note: you may need to restart the kernel to use updated packages.


## 🔍 Exploring DataFrames in `pandas`: `.head()`, `.info()`, `.describe()`

Once you've read in your CSV using `pd.read_csv()`, use these methods to **quickly explore your data**:


**`.head()`**

In [9]:
data.head()

Unnamed: 0,name,gender,subject,score
0,George,male,Math,55
1,Ian,male,English,64
2,Charlie,male,English,53
3,Fiona,female,Science,60
4,Ethan,male,English,98


- Shows the first 5 rows of the DataFrame.

- Helps you confirm the data loaded correctly.

- Optional: df.head(10) to see the first 10 rows.

**`.info()`**

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     50 non-null     object
 1   gender   50 non-null     object
 2   subject  50 non-null     object
 3   score    50 non-null     int64 
dtypes: int64(1), object(3)
memory usage: 1.7+ KB


- Gives a summary of the DataFrame:

- Number of rows and columns

- Column names and data types

- Number of non-null values

- Useful for checking for missing data and data types before cleaning.

**`.describe()`**

In [11]:
data.describe()

Unnamed: 0,score
count,50.0
mean,75.58
std,15.950293
min,53.0
25%,58.25
50%,77.5
75%,89.75
max,100.0


---

## Filtering Rows in a `pandas` DataFrame

Filtering lets you select specific rows that match a condition — like “scores above 80” or “name is Alice.

### 🔹 Basic Syntax:

```python
filtered_df = df[ df["column_name"] condition ]
```

In [15]:
filtered_df = data[data["score"] >70]
## Try printing it out to see what you get.
## filtered_df.head()

```py
df["Score"] > 80
```
→ Returns a Series of booleans like this:

0     True  
1    False  
2     True  
3    False  
...  
**Telling the data to give the rows of df where the condition inside the brackets is True.**


## ❌ Common Mistake:

### This won't work:

```python
df["Score"] > 80
```
- It only gives you the condition — not the filtered DataFrame.
- You're just generating a Series of True and False values.

## ✅ Correct:
```py
df[ df["Score"] > 80 ]
```
It gives you a new filtered DataFrame with only the rows you want — where the Score is greater than 80.

## Hands on Quiz

1) Read the CSV into a DataFrame.

2) Print the first 3 rows.

3) Print the summary statistics using .describe().

4) Filter the DataFrame to include only students who scored above 70 
5) Find the new mean and the number of rows in the new filtered Dataframe   

6) Write the filtered DataFrame to a new file called high_scores.csv.

In [13]:
data = pandas.read_csv("students_score.csv")
data.head(3)

Unnamed: 0,name,gender,subject,score
0,George,male,Math,55
1,Ian,male,English,64
2,Charlie,male,English,53


In [33]:
data.describe()

Unnamed: 0,score
count,50.0
mean,75.58
std,15.950293
min,53.0
25%,58.25
50%,77.5
75%,89.75
max,100.0


In [34]:
data = data[data['score'] > 70]

In [35]:
data.describe()

Unnamed: 0,score
count,29.0
mean,87.551724
std,8.415913
min,73.0
25%,80.0
50%,89.0
75%,93.0
max,100.0


In [36]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29 entries, 4 to 47
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     29 non-null     object
 1   gender   29 non-null     object
 2   subject  29 non-null     object
 3   score    29 non-null     int64 
dtypes: int64(1), object(3)
memory usage: 1.1+ KB
