# Topics - Pandas, Series, DataFrame, Accessing Data, Quering Rows and Columns, Operations Between Columns and IO operations

# Pandas

The pandas library is useful for dealing with ***structured data***.<br>

What is structured data? <br>
Data that is stored in tables, csv files, Excel Spreadsheets or database tables, is all structured.<br>

Unstructured data consists of free form text, images,sound or video.<br>

If you are using structured data pandas will be a great utility to you.


## Importing Pandas

Most users of pandas library will use an import alias so they can refer to it as **pd**

In [1]:
import pandas as pd

# Series

Series is a one-dimensional labeled array capable of holding any data type (integers, floats, strings, objects, etc.). It is essentially a column in a spreadsheet or a single-dimensional NumPy array with additional functionalities.<br>

Key Characteristics:
* One-dimensional: Data is arranged in a single column.
* Labeled: Each element has an associated label (index).
* Immutable: size immutable.

NOTE: When we say that series can hold any data type, we mean to say that the entire column can be of any datatype not individual values in the entire column.

![class 5](series_anatomy.png)
Image Source - Pandas Cookbook

## Creating a Series 

In [2]:
# creating a series from listed data
data = ['a','e','i','o','u']
s = pd.Series(data)
print(f'Series from a list:\n{s}')

# From a NumPy array
import numpy as np
data = np.array([1, 2, 3, 4, 5])
s = pd.Series(data)
print(f'Series from numpy array:\n{s}')

# From a dictionary
data = {'a': 1, 'b': 2, 'c': 3}
s = pd.Series(data)
print(f'Series from a dictionary:\n{s}')

Series from a list:
0    a
1    e
2    i
3    o
4    u
dtype: object
Series from numpy array:
0    1
1    2
2    3
3    4
4    5
dtype: int64
Series from a dictionary:
a    1
b    2
c    3
dtype: int64


NOTE: More methods and operation will be talked about in later course.

# Data Frame

## Introduction to Data Frame

* A DataFrame is essentially a two-dimensional labeled data structure with columns of potentially different types.<br>
* In simple terms - DataFrame: A table of data with rows and columns, where each column is a Series.
* Visually They appear like table consisting od *Rows* and *Columns*.<br>
* Hiding beneath the surface are the three components: *`index`*, *`column`*, *`data`*.

![class 5](anatomy_dataframe.png)

*`Index Labels`* and *`Column name`* refer to the individual memeber of index and columns,respectively.<br>
`Index` refers to the Index label as a whole and `Column` refers to the column name as a whole.

The labels in index and column names allow for pulling out data based on the index and column name. The index is also used for *alighment*. When multiple Series or DataFrames are combined, the indexes align first before any calculation occurs.

Collectively, the columns and the index are know as the axes.<br>
**Index - Axis 0**<br>
**Columns - Axis 1**

Pandas uses **NaN** (Not a number) *to represent missing values (including to represnt a missing string value)*.

The three consecutive dots, `...` represent that there is atleast one column that exists but could not be displayed due to display limit.

### Creating DataFrames

There are multiple ways to create a dataframe using the DataFrame() object.

NOTE: You can also create dataframe when you read a file which will be thought later in class.

In [3]:
# Creating DataFrame using Dictonary 
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(f'DataFrame using Dict:\n{df}')

# Creating DataFrame using numpy arrays
data = np.array([
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
])
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(f'DataFrame using Numpy array:\n{df}')

# Creating DataFrame using list of lists
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(f'DataFrame using Lists of List:\n{df}')

DataFrame using Dict:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
DataFrame using Numpy array:
      Name Age         City
0    Alice  25     New York
1      Bob  30  Los Angeles
2  Charlie  35      Chicago
DataFrame using Lists of List:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


### DataFrame Attributes

DataFrame attributes provide metadata and basic information about the DataFrame.<br>
Some of the DataFrame attributes are:
`df.shape`, 
`df.columns`, 
`df.index` ,
`df.dtypes` ,
`df.size` 

NOTE: 
1. You can use the print statement to print these variable or just execute the variable to see the values.
2. You do not necessarily have to name your data frame to *df* you can give it a different name. Just like in Algebra to find unknow you use 'x' but you can name the unknow variable any other alphabet.

#### `df.shape` - Returns a tuple representing the dimensionality of the DataFrame.<br>

In [4]:
data = {
    'Region': ['Europe', 'North America', 'Asia', 'Africa', 'South America'],
    'No. of Tourists': [30000000, 25000000, 45000000, 15000000, 22000000],
    'Average Temperature (F)': [55, 65, 75, 85, 75]
}
df = pd.DataFrame(data)
print('DataFrame:\n',df)

print(f'')
print(f'Shape of Data Frame: {df.shape}') # shape of the dataframe


DataFrame:
           Region  No. of Tourists  Average Temperature (F)
0         Europe         30000000                       55
1  North America         25000000                       65
2           Asia         45000000                       75
3         Africa         15000000                       85
4  South America         22000000                       75

Shape of Data Frame: (5, 3)


#### `df.columns` - Returns an Index object containing the column labels.<br>

In [5]:
#print(f'Columns of Data Frame: {df.columns}') # gices the column names of the datagrame
df.columns

Index(['Region', 'No. of Tourists', 'Average Temperature (F)'], dtype='object')

#### `df.index` - Returns an Index object containing the row labels.<br>

In [6]:
#print(f'Index of Data Frame: {df.index}') #gives the index of the dataframe
df.index

RangeIndex(start=0, stop=5, step=1)

In [7]:
# to find the index of a series you can do it the following way

df['Region'].index

RangeIndex(start=0, stop=5, step=1)

#### `df.dtypes` - Returns the data types of each column.<br>

In [8]:
#print(f'Data-types of columns of Data Frame: {df.dtypes}') #returns the datatype of the dataframe
df.dtypes

Region                     object
No. of Tourists             int64
Average Temperature (F)     int64
dtype: object

In [9]:
# to find the datatype of the series
df['Region'].dtype #change the city name to 'No. of Tourist and see the datatype

dtype('O')

#### `df.size` - Returns the number of elements in the DataFrame.<br>

In [10]:
#print(f'Size of Data Frame: {df.size}') #returns the length of the dataframe
df.size

15

#### `df.min()` - Return the minmum value

In [11]:
df['No. of Tourists'].min()

15000000

#### `df.max()` - Returns the maximmum value

In [12]:
df['Average Temperature (F)'].max()

85

#### `df.unique()` - The .unique method will return a NumPy array with the unique values.

In [13]:
df['Average Temperature (F)'].unique()

array([55, 65, 75, 85])

### DataFrame and Methods
`head()`<br>
`tail()`<br>
`info()`<br>
`describe()`<br>
`drop()`<br>
`value_counts()`<br>
`rename()`


#### `head()`: Shows the first n rows of the dataframe.<br>

In [14]:
# Data containing popular cities and some information regarding City, state, population, temp.
# this is not a true data and has been created for understanding the concept purpose
data = {
    'City Name': ['New York City', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],
    'State Name': ['New York', 'California', 'Illinois', 'Texas', 'Arizona', 'Pennsylvania', 'Texas', 'California', 'Texas', None],
    'Population (approx.)': [8400000, 4000000, 2700000, 2300000, 1600000, 1600000, 1500000, 1400000, 1400000, 1000000],
    'Average Temperature (F)': [52, 64, 51, 69, 78, 54, 72, 67, 66, 62],
    'Average Income (USD)': [73000, 68000, None, 65000, 58000,None, 55000, 70000, 63000, None]
}
city_df = pd.DataFrame(data)


# use head() to get the first n rows of dataframe
print(f'First 5 rows of the data\n{df.head(5)}') #replace 5 with any other number to see the results

First 5 rows of the data
          Region  No. of Tourists  Average Temperature (F)
0         Europe         30000000                       55
1  North America         25000000                       65
2           Asia         45000000                       75
3         Africa         15000000                       85
4  South America         22000000                       75


#### `tail()`: Shows the last n rows of the dataframe.<br>

In [15]:
# use tail to get the last n rows of the dataframe
print(f'Last 6 rows of the data\n{city_df.tail(6)}') # replace 6 with any no. to see the results

Last 6 rows of the data
      City Name    State Name  Population (approx.)  Average Temperature (F)  \
4       Phoenix       Arizona               1600000                       78   
5  Philadelphia  Pennsylvania               1600000                       54   
6   San Antonio         Texas               1500000                       72   
7     San Diego    California               1400000                       67   
8        Dallas         Texas               1400000                       66   
9      San Jose          None               1000000                       62   

   Average Income (USD)  
4               58000.0  
5                   NaN  
6               55000.0  
7               70000.0  
8               63000.0  
9                   NaN  


#### `info()`: Provides a summary of the DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.<br>

In [16]:
city_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   City Name                10 non-null     object 
 1   State Name               9 non-null      object 
 2   Population (approx.)     10 non-null     int64  
 3   Average Temperature (F)  10 non-null     int64  
 4   Average Income (USD)     7 non-null      float64
dtypes: float64(1), int64(2), object(2)
memory usage: 532.0+ bytes


#### `describe()`: Generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.<br>

In [17]:
city_df.describe()

Unnamed: 0,Population (approx.),Average Temperature (F),Average Income (USD)
count,10.0,10.0,7.0
mean,2590000.0,63.5,64571.428571
std,2219835.0,8.897565,6451.282634
min,1000000.0,51.0,55000.0
25%,1425000.0,56.0,60500.0
50%,1600000.0,65.0,65000.0
75%,2600000.0,68.5,69000.0
max,8400000.0,78.0,73000.0


#### `drop()`: Drops specified labels from rows or columns.<br>


In [18]:
# Let's delete the column 'Average Income(USD)'
city_df.drop(columns = ['Average Income (USD)'],axis = 1,inplace = False) 
city_df

#If the column does not exist it will throw an error.
#modify the inplace = True and print the df(DataFrame) in the next cell to see the results
#city_df.drop(5,axis=0)

Unnamed: 0,City Name,State Name,Population (approx.),Average Temperature (F),Average Income (USD)
0,New York City,New York,8400000,52,73000.0
1,Los Angeles,California,4000000,64,68000.0
2,Chicago,Illinois,2700000,51,
3,Houston,Texas,2300000,69,65000.0
4,Phoenix,Arizona,1600000,78,58000.0
5,Philadelphia,Pennsylvania,1600000,54,
6,San Antonio,Texas,1500000,72,55000.0
7,San Diego,California,1400000,67,70000.0
8,Dallas,Texas,1400000,66,63000.0
9,San Jose,,1000000,62,


When `inplace = False` , which is the default, then the operation is performed and it returns a copy of the object. You then need to save it to something. When `inplace = True` , the data is modified in place, which means it will return nothing and the dataframe is now updated

#### `value_counts()`: Returns a Series containing counts of unique values.

In [19]:
#The .value_counts method returns the count of all the data types in the DataFrame 
# when called on the .dtypes attribute.
dtypes_count=city_df.dtypes.value_counts() 
print(dtypes_count)

print(f'')
u_state_count=city_df['State Name'].value_counts() #gives count of unique states from the dataframe
print(f'{u_state_count}')

object     2
int64      2
float64    1
Name: count, dtype: int64

State Name
Texas           3
California      2
New York        1
Illinois        1
Arizona         1
Pennsylvania    1
Name: count, dtype: int64


#### `rename()` - This method allows you to rename columns or index labels with a dictionary mapping of old names to new names.

In [20]:
# lets rename all the columns 
city_df.rename(columns={
    'City Name':'city_name',
    'State Name':'state_name',
    'Population (approx.)':'population_approx',
    'Average Temperature (F)':'avg.temp_f',
    'Average Income (USD)':'avg_income_usd'
    }, inplace = True)

city_df

Unnamed: 0,city_name,state_name,population_approx,avg.temp_f,avg_income_usd
0,New York City,New York,8400000,52,73000.0
1,Los Angeles,California,4000000,64,68000.0
2,Chicago,Illinois,2700000,51,
3,Houston,Texas,2300000,69,65000.0
4,Phoenix,Arizona,1600000,78,58000.0
5,Philadelphia,Pennsylvania,1600000,54,
6,San Antonio,Texas,1500000,72,55000.0
7,San Diego,California,1400000,67,70000.0
8,Dallas,Texas,1400000,66,63000.0
9,San Jose,,1000000,62,


#### `sort_values()`

To sort a DataFrame by the values of a column, use the sort_values() method

Parameters:

* by: Column or list of columns to sort by.
* axis: Specifies the axis to sort along. 0 for rows (default), 1 for columns.
* ascending: Sort order for each column. Can be a boolean or a list of booleans.
* inplace: If True, modify the DataFrame in place, otherwise return a copy.
* na_position: Position of NaN values. Can be 'first' or 'last'.

In [25]:
# sorting based on one column
city_df.sort_values('population_approx',ascending=True)


# sorting based on multiple column
city_df.sort_values(['avg.temp_f','population_approx'],ascending=False)

Unnamed: 0,city_name,state_name,population_approx,avg.temp_f,avg_income_usd
4,Phoenix,Arizona,1600000,78,58000.0
6,San Antonio,Texas,1500000,72,55000.0
3,Houston,Texas,2300000,69,65000.0
7,San Diego,California,1400000,67,70000.0
8,Dallas,Texas,1400000,66,63000.0
1,Los Angeles,California,4000000,64,68000.0
9,San Jose,,1000000,62,
5,Philadelphia,Pennsylvania,1600000,54,
0,New York City,New York,8400000,52,73000.0
2,Chicago,Illinois,2700000,51,


#### `sort_index()`

To sort a DataFrame by its index, use the sort_index() method

Parameters:

* axis: Specifies the axis to sort along. 0 for rows (default), 1 for columns.
* level: Sort by a specific level in a MultiIndex.
* ascending: Sort in ascending (True) or descending (False) order.
* inplace: If True, modify the DataFrame in place, otherwise return a copy.

In [28]:
# sort index
city_df.sort_index(ascending=False)

# sort column index
city_df.sort_index(axis=1)

Unnamed: 0,avg.temp_f,avg_income_usd,city_name,population_approx,state_name
0,52,73000.0,New York City,8400000,New York
1,64,68000.0,Los Angeles,4000000,California
2,51,,Chicago,2700000,Illinois
3,69,65000.0,Houston,2300000,Texas
4,78,58000.0,Phoenix,1600000,Arizona
5,54,,Philadelphia,1600000,Pennsylvania
6,72,55000.0,San Antonio,1500000,Texas
7,67,70000.0,San Diego,1400000,California
8,66,63000.0,Dallas,1400000,Texas
9,62,,San Jose,1000000,


# Accesing Data, Quering Rows and Columns

### Selecting Columns

In [None]:
city_df['state_name'] #selecting single column

In [None]:
city_df[['city_name','avg.temp_f','state_name']] #selecting multiple columns

### `loc` Method

loc is label-based, which means that you have to specify rows and columns based on their labels.

In [None]:
#Let's Use the above city example
city_df.loc[:,"state_name"] #print all the rows of the colum 'state_name'

In [None]:
city_df.loc[0] #Print only the first row

In [None]:
city_df.loc[0:5] #print the first 6 rows

In [None]:
city_df.loc[0:2,['city_name','population_approx']] #prints mutiple rows and columns

### `iloc` Method

It is an integer-based, which means you have to specify rows and columns by their integer position.

In [None]:
city_df.iloc[4] #selecting a single row by interger position

In [None]:
city_df.iloc[2:5] #selecting multiple rows

In [None]:
city_df.iloc[:,3] #Selecting all rows but a single column

In [None]:
city_df.iloc[3:6,0:3] #selecting multiple rows and columns

### `Boolean Indexing`

Boolean indexing allows you to filter data based on conditions.

In [None]:
city_df[city_df['state_name']=='Texas']

In [None]:
city_df[city_df['avg.temp_f']< 55]

## Filtering Data

In [None]:
# Filtering Data using multiple conditions '&' and '|'
city_df[(city_df['avg.temp_f'] < 65) & (city_df['state_name'] == 'California')]

#city_df[(city_df['avg.temp_f'] < 60) | (city_df['state_name']=='California')]

## Query Method
The query method provides a way to select data using a query string.

In [None]:
city_df.query('city_name=="New York City"')

In [None]:
city_df.query('state_name == "Texas" and population_approx >= 1500000	')

## Finding and Handling Missing Data 

#### `isna()` - .isna method can be used to determine whether each individual value is missing or not. 

NOTE: isnull() is an alias for isna() and you can use this as well

In [None]:
city_df.isna()

# you can also chech it on a specific series 
#city_df.avg_income_usd.isna()

#### `fillna()` - Used to replace missing values in a dataframe or series

In [None]:
city_df.avg_income_usd = city_df.avg_income_usd.fillna('10000') # filling the missing values in avg_income_usd to 10K
city_df

#### `dropna()` - deletes entries(rows) with missing values.

In [None]:
city_df.dropna()

# Operations Between Columns

## Creating New Columns

In [None]:
# Creating new columns using direct assignment
city_areas = [302.6, 503, 227.3, 637.4, 517.6, 134.2, 465.4, 372.4, 340.5, 180.5]

city_df['city_area_sq_miles']=city_areas
city_df

In [None]:
#you can perform some operation on other columns or apply conditions 
city_df['population_density'] = round(city_df['population_approx'] / city_df['city_area_sq_miles'],2)
city_df

In [None]:
def temp_category(temp):
    if temp < 60:
        return 'Cold'
    elif 60 <= temp <= 75:
        return 'Moderate'
    else:
        return 'Hot'

# Apply the function to the 'avg_temp_f' column
city_df['temp_category'] = city_df['avg.temp_f'].apply(temp_category)
print(city_df)

## Arithmetic Operations 

In [None]:
city_df['avg.temp_f'] = city_df['avg.temp_f']+ 1
city_df

In [None]:
# Decrease the population by 100,000 for estimation correction
city_df['population_approx'] = city_df['population_approx'] - 100000
city_df

In [None]:
# Estimate population after 10% growth
city_df['population_approx_growth'] = city_df['population_approx'] * 1.10
city_df

In [None]:
# Convert population to millions
city_df['population_millions'] = city_df['population_approx'] / 1000000

# IO Operations

## Reading a File

In [None]:
# Reading a csv file 
retails_sales_df = pd.read_csv('retail_sales_dataset.csv') 
retails_sales_df

In [None]:
# excel file

## Saving a file

In [None]:
# to save the results of a dataframe into a csv file
#pd.to_csv('filename to save', index=)

# to save in excel
#pd.to_excel('filename',index=,sheet_name='')

# Data Aggregation and Group Operations


## Group By Operations

The groupby() function in Pandas allows you to split data into separate groups based on one or more keys, apply some operations to each group independently, and then combine the results back together. T

In [None]:
# Lets Take the retail sales dataset
retails_sales_df
age=retails_sales_df.groupby('Age')
#gender #creates an object of groupby 

age.count()


#Now I want to group by Gender and age
gender_age=retails_sales_df.groupby(['Gender','Age'])
gender_age.count()

## Aggregation 
Pandas provides a wide array of built-in aggregation functions that can be applied to grouped data. These functions help to compute summary statistics on your data.

![class 5](aggregate_functions.png)

In [None]:
# grouping product category and gender with Total amount spent
prod_gen=retails_sales_df.groupby(['Product Category','Gender']).agg({
    'Total Amount': 'sum'
})

print(prod_gen)

#trying to count the no. of people by Gender and age
gen_age=retails_sales_df.groupby(['Gender','Age']).agg(
    {
        'Customer ID':'count',
        'Total Amount': 'sum'
    }
)
print(gen_age)

# trying to find the no. of customers by gender and the mean age
gen=retails_sales_df.groupby('Gender').agg({
    'Age': ['mean','min','max','median'] 
})
print(gen)

## Pivot Tables

A pivot table is a powerful data analysis tool that allows you to summarize, explore, and manipulate data. It aggregates data according to specific categories, providing a multi-dimensional view of your dataset. While traditionally associated with spreadsheet software, Python offers efficient ways to create and manipulate pivot tables using the pandas library.

![class 5](pivot_tables.png)

In [None]:
# converting the data from 'Date' Column into Month and year
retails_sales_df['Date'] = pd.to_datetime(retails_sales_df['Date'])
retails_sales_df['month_year'] = retails_sales_df['Date'].dt.strftime('%Y - %m')

In [None]:
# creating a pivot table with Product category as index and month_year as column and sum of total amount as values

pv_tbl=retails_sales_df.pivot_table(index='Product Category', columns= 'month_year', values='Total Amount', aggfunc='sum')

# sorting the column values in ascending order
pv_tbl.sort_index(axis=1)

## Crosstabulation

Crosstabulation, or cross-tab, is a method to quantitatively analyze the relationship between multiple variables. It is similar to a pivot table but is more focused on the count of occurrences of combinations of categories.

The crosstab() function in Pandas is used to create a cross-tabulation of two or more factors. It’s primarily used for categorical data.

In [None]:
# Simple cross-tab
pro_gen_tab=pd.crosstab(retails_sales_df['Product Category'],retails_sales_df['Gender'])
print(pro_gen_tab)

In [None]:
# cross-tab with aggregate function
total_tab=pd.crosstab(retails_sales_df['Product Category'],retails_sales_df['Gender'],retails_sales_df['Total Amount'],aggfunc='sum')
total_tab

In [None]:
# You can normalize the data to get the percentage or proportion of each category combination.
nor_tab=pd.crosstab(retails_sales_df['Product Category'],retails_sales_df['Gender'],normalize=True)
nor_tab

In [None]:
# multi-dimentional cross tabs
multi_dim_tab=pd.crosstab([retails_sales_df['Product Category'],retails_sales_df['Gender']],retails_sales_df['Age'])
multi_dim_tab