# **Activity 2: Python-Pandas Exercise**

Objectives:
- Understand Python syntax (variables, loops, functions).
- Learn Pandas basics (Series, DataFrames, reading files).
- Perform data cleaning (handling missing values, correcting formats, removing duplicates).
- Apply concepts in a real-world case study.

# Part 1: Hands-on Python & Pandas Basics

1. Install the Pandas library in your environment.

In [2]:
pip install pandas

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


2. Import the  pandas package under the name `pd`

In [3]:
import pandas as pd

3. Print the pandas version

In [4]:
print(pd.__version__)

2.2.3


4. Create a variable `x` with the value 10 and a string variable `y` with "Fortes in Fide!"

In [5]:
data = {"x": [10], "y":["Fortes in Fide!"]}

df = pd.DataFrame(data)
print(df)

    x                y
0  10  Fortes in Fide!


5. Define a list with numbers `[1, 2, 3, 4, 5]` and a dictionary with keys `name` and `age`

In [6]:
num_list = [1, 2, 3, 4, 5]

data = {"name": ["Ivan", "Mary Vee", "Erika"], "age": [22, 21, 21]}

df = pd.DataFrame(data)
print(num_list)
print(df)

[1, 2, 3, 4, 5]
       name  age
0      Ivan   22
1  Mary Vee   21
2     Erika   21


6. Write a function `greet(name)` that returns "Magis, (name)"!

In [7]:
def greet(name):
    print("Magis, ", name)

greet("Mary Vee")

Magis,  Mary Vee


7. Write a Python function that takes a user’s name as input and prints a personalized greeting.

In [8]:
name = input("What is your name? ").strip()

def greet2(name):
    return f"{name}, Praise be Jesus and Mary!"

print(greet2(name))

Erika, Praise be Jesus and Mary!


8. Modify **Number 7** that if the user does not enter a name, it defaults to "Guest".

In [9]:
name = input("Input Name").strip()

def greet2(name):
    if len(name) != 0 :
        return f"{name}, Praise be Jesus and Mary!"
    else:
        return f"Guest, Praise be Jesus and Mary!"

print(greet2(name))

Guest, Praise be Jesus and Mary!


9. Create a Pandas Series from `[10, 20, 30, 40]`.

In [10]:
series = [10, 20, 30, 40]

mySeries = pd.Series(series)

print(mySeries)

0    10
1    20
2    30
3    40
dtype: int64


10.  Create a DataFrame with columns `A` and `B`.

In [11]:
data = {
    "A" : [10,20,30],
    "B" : [5,6,7]
}

df = pd.DataFrame(data)

print(df)

    A  B
0  10  5
1  20  6
2  30  7


# Part 2: Working with a Dataset 🛥️

1. Load the Titanic dataset from a local file and display the first five rows.

In [12]:
tit = pd.read_csv('titanic_dataset.csv')

print(tit.head(5))

   PassengerId  Survived  Pclass  \
0          892         0       3   
1          893         1       3   
2          894         0       2   
3          895         0       3   
4          896         1       3   

                                           Name     Sex   Age  SibSp  Parch  \
0                              Kelly, Mr. James    male  34.5      0      0   
1              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   
2                     Myles, Mr. Thomas Francis    male  62.0      0      0   
3                              Wirz, Mr. Albert    male  27.0      0      0   
4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1   

    Ticket     Fare Cabin Embarked  
0   330911   7.8292   NaN        Q  
1   363272   7.0000   NaN        S  
2   240276   9.6875   NaN        Q  
3   315154   8.6625   NaN        S  
4  3101298  12.2875   NaN        S  


2. Display the dataset's column names, data types.

In [13]:
print(tit.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB
None


3. Display the dataset's missing values.

In [14]:
print(tit.isna().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


4. Display the `Name`, `Age`, and `Fare` columns from the dataset. (first 10)

In [15]:
print(tit[['Name', 'Age', 'Fare']].head(10))

                                           Name   Age     Fare
0                              Kelly, Mr. James  34.5   7.8292
1              Wilkes, Mrs. James (Ellen Needs)  47.0   7.0000
2                     Myles, Mr. Thomas Francis  62.0   9.6875
3                              Wirz, Mr. Albert  27.0   8.6625
4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  22.0  12.2875
5                    Svensson, Mr. Johan Cervin  14.0   9.2250
6                          Connolly, Miss. Kate  30.0   7.6292
7                  Caldwell, Mr. Albert Francis  26.0  29.0000
8     Abrahim, Mrs. Joseph (Sophie Halaut Easu)  18.0   7.2292
9                       Davies, Mr. John Samuel  21.0  24.1500


 5. Print the descriptive statistics of the Titanic dataset.

In [16]:
df = pd.DataFrame(tit)

print(df.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   418.000000  418.000000  418.000000  332.000000  418.000000   
mean   1100.500000    0.363636    2.265550   30.272590    0.447368   
std     120.810458    0.481622    0.841838   14.181209    0.896760   
min     892.000000    0.000000    1.000000    0.170000    0.000000   
25%     996.250000    0.000000    1.000000   21.000000    0.000000   
50%    1100.500000    0.000000    3.000000   27.000000    0.000000   
75%    1204.750000    1.000000    3.000000   39.000000    1.000000   
max    1309.000000    1.000000    3.000000   76.000000    8.000000   

            Parch        Fare  
count  418.000000  417.000000  
mean     0.392344   35.627188  
std      0.981429   55.907576  
min      0.000000    0.000000  
25%      0.000000    7.895800  
50%      0.000000   14.454200  
75%      0.000000   31.500000  
max      9.000000  512.329200  


6. Remove rows with missing values in the `Age` column.

In [17]:
missing_age = df.dropna(subset=['Age'])

print(missing_age.to_string())

     PassengerId  Survived  Pclass                                                             Name     Sex    Age  SibSp  Parch              Ticket      Fare            Cabin Embarked
0            892         0       3                                                 Kelly, Mr. James    male  34.50      0      0              330911    7.8292              NaN        Q
1            893         1       3                                 Wilkes, Mrs. James (Ellen Needs)  female  47.00      1      0              363272    7.0000              NaN        S
2            894         0       2                                        Myles, Mr. Thomas Francis    male  62.00      0      0              240276    9.6875              NaN        Q
3            895         0       3                                                 Wirz, Mr. Albert    male  27.00      0      0              315154    8.6625              NaN        S
4            896         1       3                     Hirvonen, Mrs. Alexa

7. Remove duplicate rows from the dataset.

In [18]:
print(df.duplicated())

0      False
1      False
2      False
3      False
4      False
       ...  
413    False
414    False
415    False
416    False
417    False
Length: 418, dtype: bool


8. Compute and display the correlation matrix of the dataset.

In [19]:
print(df.corr())

ValueError: could not convert string to float: 'Kelly, Mr. James'

corr() ignores columns that are not numeric

# Part 2: Working with Case Studies

When working on these case studies, **always ensure that your code is properly documented and clearly presented**. Follow these key principles:  

### **1. Always Show Your Code**  
- Every step of data exploration, cleaning, and analysis should include **visible code outputs**.  
- Do not skip showing your process, as transparency is essential for reproducibility.  

### **2. Proper Documentation is Necessary**  
- Use **comments (`#`) in Python** to explain your code clearly.  
- Add **Markdown cells** to describe each step before executing the code.  
- Explain key findings in simple language to make the analysis easy to understand.  

### **3. Use Readable and Organized Code**  
- Follow a **step-by-step approach** to keep the notebook structured.  
- Use **proper variable names** and avoid hardcoding values where possible.

# **Case Study 1: Iris Flower Classification** 🌸  

### **Background**  
A botanical research institute wants to develop an automated system that classifies different species of **iris flowers** based on their **sepal and petal measurements**.  The dataset consists of **150 samples**, labeled as **Setosa, Versicolor, or Virginica**.  

### **Problem Statement**  
Can we use **sepal and petal dimensions** to correctly classify the **species of an iris flower**?  

### **Task Description**  

#### **1. Data Exploration**  
- Load the dataset and display the first few rows.  
- Identify any missing or inconsistent values.  

#### **2. Data Cleaning**  
- Check for missing values and handle them appropriately.  
- Convert categorical species labels into a format suitable for analysis.  

#### **3. Basic Data Analysis**  
- Find the average sepal and petal dimensions for each species.  
- Identify correlations between different flower measurements.  

#### **4. Visualization**  
- Create simple visualizations (e.g., histograms, scatter plots) to understand data distribution.  

#### **5. Insights & Interpretation**  
- Summarize key findings, such as which features best distinguish flower species.  

In [None]:
#Part 1

# Read only first few rows
flower = pd.read_csv('iris_dataset.csv')
print(flower) 


#Check for any missing or inconsistent values
print(flower.isna())

#Part 2
flower.fillna(0, inplace = True)

#Part 3

# **Case Study 2: Netflix Content Analysis** 🎬  

## **Background**  
Netflix is a leading streaming platform with a vast collection of movies and TV shows. The company wants to analyze its **content library** to understand trends in **genres, release years, and regional distribution**.  

## **Problem Statement**  
How can we use **Netflix’s dataset** to gain insights into content distribution, popular genres, and release trends over time?  

## **Task Description**  

### **1. Data Exploration**  
- Load the dataset and inspect its structure.  
- Identify key columns such as title, genre, release year, and country.  

### **2. Data Cleaning**  
- Check for missing or incorrect values in key columns.  
- Remove duplicates and format the date-related data properly.  

### **3. Basic Data Analysis**  
- Count the number of movies vs. TV shows.  
- Identify the most common genres and countries producing content.  
- Analyze the number of releases per year to observe trends.  

### **4. Insights & Interpretation**  
- Summarize key findings, such as trends in Netflix's content production over time.  


In [None]:
flix = pd.read_csv('netflix_dataset.csv')
df = pd.DataFrame(flix)

# print(flix.head(5))
# print(flix.info())
# print(flix[['title', 'listed_in', 'release_year', 'country']])
# print(df.describe())

missing_data = df.dropna()
duplicated_data = df.duplicated()

# print(missing_data.to_string())
print(duplicated_data.to_string())
