<a href="https://colab.research.google.com/github/ZHAbotorabi/Pandas-Projects/blob/main/Pandas_DataFrames_Introduction_and_Operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas DataFrames

In this section we learn Operations on DataFrame:
## 1- Create DataFramem       
* pd.DataFrame(data)

## 2- Examine Data

* Basic structure: df.info()
* Descriptive statistics: df.describe()
* Column names: df.columns
* First few rows: df.head()

## 3- Indexing

* Single column: df['Column']
* Multiple columns: df[['Column1', 'Column2']]
* Specific row: df.iloc[row_index]

## 4- Slicing
* Specific rows and columns: df.iloc[start:end, start:end]

## 5- Numeric Operations
* Add a value: df['Column'] + value
* Mean of a column: df['Column'].mean()

## 6- Boolean Filtering

* Filter rows: df[df['Column'] > value]
* Combine filters: (df['Column1'] > value1) & (df['Column2'] == value2)

## 7- Handling Missing Data

* Find missing cells: df.isnull()
* Count missing values: df.isnull().sum()
* Fill missing values: df['Column'].fillna(value, inplace=True)

## 8- Combining Filters

Example:

* df[(df['Column1'] > value1) & (df['Column2'] == value2)]

## 1. Create or Convert Data Types into a DataFrame
Create a DataFrame from scratch or turn other data structures (like dictionaries or lists) into a DataFrame.

In [3]:
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Survived': [1, 0, 1, 1]
}

df = pd.DataFrame(data)

df.head()


Unnamed: 0,Name,Age,City,Survived
0,Alice,24,New York,1
1,Bob,27,Los Angeles,0
2,Charlie,22,Chicago,1
3,David,32,Houston,1


### 2. Examine DataFrame Data
Explore the basic structure and contents of the DataFrame.

In [10]:
# Display the first few rows
print(df.head())
print("\n----------&&----------\n")

# Get a summary of the DataFrame
print(df.info())
print("\n--------&&---------\n")

# Describe numeric columns
print(df.describe())
print("\n-----&&-------\n")

# View the column names
print(df.columns)


      Name  Age         City  Survived
0    Alice   24     New York         1
1      Bob   27  Los Angeles         0
2  Charlie   22      Chicago         1
3    David   32      Houston         1

----------&&----------

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      4 non-null      object
 1   Age       4 non-null      int64 
 2   City      4 non-null      object
 3   Survived  4 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes
None

--------&&---------

             Age  Survived
count   4.000000      4.00
mean   26.250000      0.75
std     4.349329      0.50
min    22.000000      0.00
25%    23.500000      0.75
50%    25.500000      1.00
75%    28.250000      1.00
max    32.000000      1.00

-----&&-------

Index(['Name', 'Age', 'City', 'Survived'], dtype='object')


### 3. Indexing and Selecting Segments/Slicing of the DataFrame
Use indexing to select specific rows, columns, or subsets of the DataFrame.

In [13]:
# Selecting a specific column
print(df['Age'])
print("\n-----&&-------------------\n")

# Selecting multiple columns
print(df[['Name', 'City']])
print("\n-----&&------------------\n")

# Selecting a specific row using the index
print(df.iloc[1])  # Second row


0    24
1    27
2    22
3    32
Name: Age, dtype: int64

-----&&-------------------

      Name         City
0    Alice     New York
1      Bob  Los Angeles
2  Charlie      Chicago
3    David      Houston

-----&&------------------

Name                Bob
Age                  27
City        Los Angeles
Survived              0
Name: 1, dtype: object


## 4. Slicing Using iloc
Use .iloc for row and column slicing by position.

In [15]:
# Slice rows 1 to 3 and columns 1 to 2
print(df.iloc[1:3, 1:3])

print("\n-----&&------------------\n")

# Select specific rows and columns
print(df.iloc[[0, 2], [0, 3]])  # Rows 0 and 2, columns 0 and 3


   Age         City
1   27  Los Angeles
2   22      Chicago

-----&&------------------

      Name  Survived
0    Alice         1
2  Charlie         1


## 5. Numeric Operations on DataFrame
Perform numeric operations on columns or entire DataFrames.


In [17]:
# Add 5 years to everyone's age
df['Age'] = df['Age'] + 5
df.head()

Unnamed: 0,Name,Age,City,Survived
0,Alice,34,New York,1
1,Bob,37,Los Angeles,0
2,Charlie,32,Chicago,1
3,David,42,Houston,1


In [18]:
# Calculate the mean of the Age column
mean_age = df['Age'].mean()
print(f"Mean Age: {mean_age}")



Mean Age: 36.25


## 6. Filtering Using Boolean Operations
Filter rows based on conditions using Boolean indexing.

In [19]:
# Filter rows where age is greater than 25
filtered_df = df[df['Age'] > 25]
filtered_df.head()

Unnamed: 0,Name,Age,City,Survived
0,Alice,34,New York,1
1,Bob,37,Los Angeles,0
2,Charlie,32,Chicago,1
3,David,42,Houston,1


In [20]:
# Filter rows where survivors are from New York
ny_survivors = df[(df['City'] == 'New York') & (df['Survived'] == 1)]
ny_survivors.head()

Unnamed: 0,Name,Age,City,Survived
0,Alice,34,New York,1


## 7. Finding Empty Cells Using isnull()
Identify and handle missing data.

In [23]:
# Example with missing values
df['City'][1] = None  # Introducing a missing value for demonstration

# Find missing cells
print(df.isnull())

    Name    Age   City  Survived
0  False  False  False     False
1  False  False  False     False
2  False  False  False     False
3  False  False  False     False


In [24]:
# Count missing values per column
print(df.isnull().sum())

Name        0
Age         0
City        0
Survived    0
dtype: int64


In [25]:
# Fill missing values with a default value
df['City'].fillna('Unknown', inplace=True)
print(df)

      Name  Age      City  Survived
0    Alice   34  New York         1
1      Bob   37   Unknown         0
2  Charlie   32   Chicago         1
3    David   42   Houston         1


### 8. Combining Boolean Filtering
Combine multiple filters to create more advanced conditions.


In [27]:
# Filter passengers who are older than 25 and survived
combined_filter = df[(df['Age'] > 25) & (df['Survived'] == 1)]
print(combined_filter)


      Name  Age      City  Survived
0    Alice   34  New York         1
2  Charlie   32   Chicago         1
3    David   42   Houston         1


In [28]:
# Filter passengers who are from 'New York' or 'Chicago'
city_filter = df[(df['City'] == 'New York') | (df['City'] == 'Chicago')]
print(city_filter)

      Name  Age      City  Survived
0    Alice   34  New York         1
2  Charlie   32   Chicago         1


## Some Exersice:

In [29]:
# Slicing a row

df.loc[0,:]

Unnamed: 0,0
Name,Alice
Age,34
City,New York
Survived,1


# Slicing using iloc

### df.iloc[row_index, column_index]

* **loc**: label based selection
* **iloc**: integer position based selection

In [30]:
df.iloc[0,1]

34

In [31]:
df.iloc[0]

Unnamed: 0,0
Name,Alice
Age,34
City,New York
Survived,1


### Note: A 1-Dim pandas objects is a series. A 2-Dim pandas object is a dataframe.

In [32]:
# Slicing mulitple columns

df.loc[:,["Name", "City"]]

Unnamed: 0,Name,City
0,Alice,New York
1,Bob,Unknown
2,Charlie,Chicago
3,David,Houston


In [33]:
# An alternative way of slicing columns

df[["Name", "Age"]]

Unnamed: 0,Name,Age
0,Alice,34
1,Bob,37
2,Charlie,32
3,David,42


In [34]:
# Slicing rows is similar as well

df.loc[2:4,:]

Unnamed: 0,Name,Age,City,Survived
2,Charlie,32,Chicago,1
3,David,42,Houston,1


# Numeric Operations on Series

# Descriptive Stats on series

In [37]:
df["Age"].max()

42

In [38]:
df["Age"].min()

32

In [39]:
df["Age"].mean()

36.25

In [None]:
a.median()

3.0

In [40]:
df["Age"].mode()

Unnamed: 0,Age
0,32
1,34
2,37
3,42


In [41]:
df["Age"].sum()

145

In [43]:
df["Age"]

Unnamed: 0,Age
0,34
1,37
2,32
3,42


In [44]:
df["Age"].value_counts()

Unnamed: 0_level_0,count
Age,Unnamed: 1_level_1
34,1
37,1
32,1
42,1


In [45]:
df["Age"].describe()

Unnamed: 0,Age
count,4.0
mean,36.25
std,4.349329
min,32.0
25%,33.5
50%,35.5
75%,38.25
max,42.0


In [46]:
# Remember our Titanic Dataset? can we use describe on a coloumn of data

df.head()

Unnamed: 0,Name,Age,City,Survived
0,Alice,34,New York,1
1,Bob,37,Unknown,0
2,Charlie,32,Chicago,1
3,David,42,Houston,1


In [47]:
df['Age'].describe()

# count - number of rows or values in that column
# unique - number of unique categories in the column
# top - most populous category
# freq - count of the most popular category

Unnamed: 0,Age
count,4.0
mean,36.25
std,4.349329
min,32.0
25%,33.5
50%,35.5
75%,38.25
max,42.0


In [48]:
# Let's learn a new function called unique

df['City'].unique()

array(['New York', 'Unknown', 'Chicago', 'Houston'], dtype=object)

# Descriptive Stats on DataFrames

In [49]:
df[["Age","Survived"]].mean()

Unnamed: 0,0
Age,36.25
Survived,0.75


In [50]:
# Specifying the axies
# Note Axis = 0 is used by default and calculates the column statistics

df[["Age","Survived"]].mean(axis=0)

Unnamed: 0,0
Age,36.25
Survived,0.75


### Visualizing Axis
```
+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 10      | 15     |----axis=1----->
+------------+---------+--------+
             |         |
             | axis=0  |
             ↓         ↓
```



In [51]:
# if we use axis = 1, we are calculating the row statistics

df[["Age","Survived"]].mean(axis=1)

Unnamed: 0,0
0,17.5
1,18.5
2,16.5
3,21.5


In [52]:
# More examples

df.mean(numeric_only=True)

Unnamed: 0,0
Age,36.25
Survived,0.75


# Filtering using Boolean Operations

In [53]:
df.head()

Unnamed: 0,Name,Age,City,Survived
0,Alice,34,New York,1
1,Bob,37,Unknown,0
2,Charlie,32,Chicago,1
3,David,42,Houston,1


In [54]:
df['Survived'] == 1

Unnamed: 0,Survived
0,True
1,False
2,True
3,True


In [58]:
df['Age'] > 40

Unnamed: 0,Age
0,False
1,False
2,False
3,True


In [59]:
# Viewing on those that are true

df[df['Age'] > 40]

Unnamed: 0,Name,Age,City,Survived
3,David,42,Houston,1


In [61]:
# Filtering on exact cell values

df[df['Name'] == 'Alice']

Unnamed: 0,Name,Age,City,Survived
0,Alice,34,New York,1


# Finding empty cells using isnull()

In [62]:
df.isnull()

Unnamed: 0,Name,Age,City,Survived
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [63]:
df['Survived'].isnull()

Unnamed: 0,Survived
0,False
1,False
2,False
3,False


In [64]:
df[df['Survived'].isnull()]

Unnamed: 0,Name,Age,City,Survived


In [65]:
df[df['Survived'].notnull()]

Unnamed: 0,Name,Age,City,Survived
0,Alice,34,New York,1
1,Bob,37,Unknown,0
2,Charlie,32,Chicago,1
3,David,42,Houston,1


# Combining Boolean Filtering

In [72]:
over_30 = df['Age'] > 40
Survived_exists = df['Survived'].notnull()
Survived_exists

Unnamed: 0,Survived
0,True
1,True
2,True
3,True


In [71]:
over_30_Survived_exists = over_30 + Survived_exists
over_30_Survived_exists

Unnamed: 0,0
0,True
1,True
2,True
3,True


In [77]:
df[over_30_Survived_exists]

Unnamed: 0,Name,Age,City,Survived
0,Alice,34,New York,1
1,Bob,37,Unknown,0
2,Charlie,32,Chicago,1
3,David,42,Houston,1


In [81]:
over_40_and_under_34 = (df['Age'] > 40) | (df['Age'] < 34)

In [82]:
df[over_40_and_under_34]

Unnamed: 0,Name,Age,City,Survived
2,Charlie,32,Chicago,1
3,David,42,Houston,1


# NOTE:

#### a & b (pandas) = a and b (python)
#### a | b (pandas) = a or b (python)
