What is a DataFrame?
-A DataFrame is like a spreadsheet or a table in a database.
-It has rows and columns.
-Each column can have a different type of data (like numbers, text, dates, etc.).
-It's a structure used in programming (especially in Python using a library called pandas) to handle and analyze data easily.
EX----------
+----------+---------+------------+
| Name     | Age     | City       |
+----------+---------+------------+
| Alice    | 25      | New York   |
| Bob      | 30      | London     |
| Charlie  | 28      | Sydney     |
+----------+---------+------------+

A DataFrame makes it super easy to:
-View and explore your data.
-Filter and sort information (e.g., only show people over 25).
-Clean messy data.
-Perform calculations or summaries (e.g., average age).
-Plot charts and graphs easily.

Imagine you have a big Excel sheet with customer information.
A DataFrame is just that — but in code form — where you can tell the computer:
----"Hey, show me everyone who lives in London and is over 25."
 ---…and it will do it in seconds.






In [3]:
# 1. Load data using pandas.read_csv()


import pandas as pd
   # Load your CSV file
df = pd.read_csv("people.csv")
   # df is your DataFrame now.
   # Replace "your_file.csv" with your actual file path or URL.

# 2. Check for missing values using .isnull()

   # Check where values are missing
missing = df.isnull()

   # Or, get a quick count of missing values in each column
missing_count = df.isnull().sum()
print(missing_count)
print("------------------------------------------------------------")


# 3. Explore the data using .head(), .info(), .describe()

# Show first 5 rows
print(df.head(5))
print("--------------------------------------------------------------")

# Show info about columns, types, and non-null counts
print(df.info())
print("---------------------------------------------------------------"

# Show statistics (for numeric columns)
print(df.describe())

#  .head() = Peek at the first few rows.
#  .info() = Quick overview (column names, data types, missing values).
#  .describe() = Stats like mean, min, max, std dev, etc.










name      0
age       0
city      0
salary    0
dtype: int64
------------------------------------------------------------
      name  age         city  salary
0    Alice   29     New York   85000
1      Bob   35  Los Angeles   92000
2  Charlie   22      Chicago   55000
3    Diana   31     New York   99000
4    Ethan   27       Austin   71000
--------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    10 non-null     object
 1   age     10 non-null     int64 
 2   city    10 non-null     object
 3   salary  10 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 452.0+ bytes
None
---------------------------------------------------------------
             age         salary
count  10.000000      10.000000
mean   30.200000   81900.000000
std     6.545567   16676.330532
min    22.000000   55000

In [15]:
import pandas as pd

# Step 1: Load the dataset from your uploaded CSV
df = pd.read_csv("titanic.csv")

# Step 2: Take a look at the first few rows
print("🔍 Preview of the data:")
print(df.head())

# Step 3: Handle missing values
df['age'] = df['age'].fillna(df['age'].mean())  # Fill missing ages with average age
df['embark_town'] = df['embark_town'].fillna("Unknown")  # Fill missing embark_town

# Step 4: Rename columns for clarity
df = df.rename(columns={
    "sex": "gender",
    "embark_town": "embarkation_town"
})

# Step 5: Filter data — females who survived
females_survived = df[(df['gender'] == 'female') & (df['survived'] == 1)]

# Step 6: Sort the data by 'fare' in descending order
df_sorted = df.sort_values(by='fare', ascending=False)

# Step 7: Show the top 5 passengers who paid the most
print("\n💰 Top 5 passengers by fare:")
print(df_sorted[['name', 'gender', 'age', 'fare']].head())

# Optional: Save cleaned data to a new file
df_sorted.to_csv("titanic_cleaned.csv", index=False)


🔍 Preview of the data:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0

KeyError: 'age'

Mini Project
- [ ] Choose a dataset (from Kaggle or anywhere)
- [ ] Write a mini report in markdown:
  - What’s the dataset about?
  - Any surprising nulls or weird values?
  - What did you clean or change?


What’s the dataset about?
This dataset contains information about passengers aboard the RMS Titanic, the ship that tragically sank in 1912. Each row represents a passenger, and includes data such as:

Name, Age, Sex, Class

Whether they survived (survived)
The fare they paid (fare)
Where they embarked (embark_town)
Family members onboard

The goal is often to analyze patterns of survival and understand which features influenced the likelihood of survival.

In [29]:
import pandas as pd

df = pd.read_csv("titanic.csv")

print(df.isnull().sum())

#We found missing values in these columns:
# age — many passengers have missing ages
# cabin — a large number of missing values (often dropped entirely)
# embarked — a few missing port codes


# Fill missing ages with the average
df['Age'] = df['Age'].fillna(df['Age'].mean())
## Fill missing embarked values with 'Unknown'
df['Embarked'] = df['Embarked'].fillna('Unknown')
df['Cabin'] = df['Cabin'].fillna('Unknown')

# Rename columns to be more intuitive
df = df.rename(columns={"Sex": "Gender","Embarked": "embarkation_town"})


print(df.isnull().sum())  # all null spaces done


# Sort passengers by fare (descending)
df_sorted = df.sort_values(by='Fare', ascending=False)
print(df_sorted.head(3))

# Filter for female survivors as an example subset
females_survived = df[(df['Gender'] == 'female') & (df['Survived'] == 1)]
print(df)


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
PassengerId         0
Survived            0
Pclass              0
Name                0
Gender              0
Age                 0
SibSp               0
Parch               0
Ticket              0
Fare                0
Cabin               0
embarkation_town    0
dtype: int64
     PassengerId  Survived  Pclass                                Name  \
679          680         1       1  Cardeza, Mr. Thomas Drake Martinez   
258          259         1       1                    Ward, Miss. Anna   
737          738         1       1              Lesurer, Mr. Gustave J   

     Gender   Age  SibSp  Parch    Ticket      Fare        Cabin  \
679    male  36.0      0      1  PC 17755  512.3292  B51 B53 B55   
258  female  35.0      0      0  PC 17755  512.32