
# Getting Started with pandas 📊

Welcome! This notebook will introduce you to `pandas`, a powerful Python library for data manipulation and analysis.


In [1]:

import pandas as pd
import numpy as np



## What is a Series?

A `Series` is a one-dimensional labeled array that can hold data of any type.


In [2]:

print("Creating a Series with numbers and a missing value:")
numbers = pd.Series([10, 20, 30, np.nan, 50])
print(numbers)


Creating a Series with numbers and a missing value:
0    10.0
1    20.0
2    30.0
3     NaN
4    50.0
dtype: float64



## Creating a DataFrame

A `DataFrame` is a 2D table with labeled axes (rows and columns).


In [3]:

print("Creating a random 3x4 DataFrame with fruit names as columns:")
data = np.random.randint(1, 100, size=(3, 4))
df = pd.DataFrame(data, columns=['Apples', 'Bananas', 'Oranges', 'Pears'])
print(df)


Creating a random 3x4 DataFrame with fruit names as columns:
   Apples  Bananas  Oranges  Pears
0       4       66       60     16
1      94       59       65     64
2      75        7       78     73



## Working with Dates

pandas makes working with date ranges simple.


In [4]:
print("Creating a DataFrame with a DateTime index:")
date_index = pd.date_range(start='2022-01-01', periods=5)
df_dates = pd.DataFrame(np.random.randn(5, 3), index=date_index, columns=['X', 'Y', 'Z'])
print(df_dates)

Creating a DataFrame with a DateTime index:
                   X         Y         Z
2022-01-01  1.611931 -0.348491  2.119816
2022-01-02 -0.010099 -1.378896  1.241710
2022-01-03 -0.360585 -0.074946  3.625112
2022-01-04 -1.799954  0.323422 -0.482175
2022-01-05 -0.272994  0.626908  0.435980



## Data Inspection

You can view portions of the DataFrame or inspect its structure easily.


In [5]:
print("First rows of the DataFrame:")
print(df.head())

print("Last rows of the DataFrame:")
print(df.tail())

print("Column names in the DataFrame:")
print(df.columns)

First rows of the DataFrame:
   Apples  Bananas  Oranges  Pears
0       4       66       60     16
1      94       59       65     64
2      75        7       78     73
Last rows of the DataFrame:
   Apples  Bananas  Oranges  Pears
0       4       66       60     16
1      94       59       65     64
2      75        7       78     73
Column names in the DataFrame:
Index(['Apples', 'Bananas', 'Oranges', 'Pears'], dtype='object')



## Data Transformation

Here we demonstrate how to modify and filter data using pandas.


In [12]:
print("Adding a new column 'Total' which is the row-wise sum:")
df['Total'] = df.sum(axis=1, numeric_only=True)
df

print("Sorting the DataFrame by the 'Total' column:")
df_sorted = df.sort_values(by='Total')
print(df_sorted)

Adding a new column 'Total' which is the row-wise sum:
Sorting the DataFrame by the 'Total' column:
   Apples  Bananas  Oranges  Pears  Total  Apples x2   Category
0       4       66       60     16    454          8      Fruit
2      75        7       78     73    999        150  Vegetable
1      94       59       65     64   1222        188      Fruit



## Selecting Rows

Use `loc` for label-based access and `iloc` for integer-location access.


In [7]:
print("Selecting rows 0 to 1 and columns 1 to 2 using iloc:")
selected = df.iloc[0:2, 1:3]
print(selected)

Selecting rows 0 to 1 and columns 1 to 2 using iloc:
   Bananas  Oranges
0       66       60
1       59       65



## Handling Missing Data

Missing values are common in real-world datasets. Here's how to detect and handle them.


In [13]:
print("Creating a copy of the DataFrame and inserting a NaN value:")
df_with_nan = df.copy()
df_with_nan.iloc[0, 0] = np.nan
print(df_with_nan)

print("Checking where the missing values are:")
print(df_with_nan.isna())

print("Any missing?", df_with_nan.isna().any().any())

Creating a copy of the DataFrame and inserting a NaN value:
   Apples  Bananas  Oranges  Pears  Total  Apples x2   Category
0     NaN       66       60     16    454          8      Fruit
1    94.0       59       65     64   1222        188      Fruit
2    75.0        7       78     73    999        150  Vegetable
Checking where the missing values are:
   Apples  Bananas  Oranges  Pears  Total  Apples x2  Category
0    True    False    False  False  False      False     False
1   False    False    False  False  False      False     False
2   False    False    False  False  False      False     False
Any missing? True



## Applying Functions

You can apply transformations across columns using `.apply()`.


In [14]:
print("Applying a lambda function to double the 'Apples' column:")
df['Apples x2'] = df['Apples'].apply(lambda x: x * 2)
print(df)

Applying a lambda function to double the 'Apples' column:
   Apples  Bananas  Oranges  Pears  Total  Apples x2   Category
0       4       66       60     16    454          8      Fruit
1      94       59       65     64   1222        188      Fruit
2      75        7       78     73    999        150  Vegetable



## Grouping Data

Use `groupby()` to perform operations over subsets of your data.


In [10]:
print("Grouping rows by 'Category' and calculating the mean of 'Apples' and 'Bananas':")
df['Category'] = ['Fruit', 'Fruit', 'Vegetable']
print(df.groupby('Category')[['Apples', 'Bananas']].mean())

Grouping rows by 'Category' and calculating the mean of 'Apples' and 'Bananas':
           Apples  Bananas
Category                  
Fruit        49.0     62.5
Vegetable    75.0      7.0
