# Introduction To Pandas
-------------------------

Pandas is a Python library used for data manipulation and analysis. 
It provides powerful data structures, primarily DataFrame and Series, for handling structured data.

In [1]:
# Install pandas

!pip install pandas



In pandas documentation, it is standard practice to import the library using the alias pd, and this convention is assumed throughout all examples.

In [54]:
#Import pandas library
import pandas as pd


In [55]:
# Check the version of the pandas
print(pd.__version__)

2.1.4


I want to store passenger data of the df. Creating a columns name(Characters), age(int) and gender(male, female)

In [56]:
# To manually store data in a table, create a DataFrame
df = pd.DataFrame({
    'Name':  ["Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth"],
    'Age': [22, 35, 58],
    'Gender':["male", "male", "female"]
})

df

Unnamed: 0,Name,Age,Gender
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


In [57]:
df['Age']   # Display specific column

0    22
1    35
2    58
Name: Age, dtype: int64

## Basic Data Structure in Pandas

#### 1. Series:
A Series is a one-dimensional labeled array that can hold any data type 
(integers, strings, floats, etc.).

**Size Mutable:** 
- A Series is not easily resizable (size-mutable) like a DataFrame.
- You cannot directly add or remove elements (rows) in a Series.
- To change size, you must reassign or use methods that return a new Series.
- You can modify existing values in a Series without changing its size.

In [58]:
# Creating Series
series = pd.Series([11,22,33,44,55])

# Reassigning to add an element
ser = pd.concat([series, pd.Series([66])], ignore_index= True)
print(ser)

0    11
1    22
2    33
3    44
4    55
5    66
dtype: int64


### Homogeneous

A Series is homogeneous by nature, meaning that all elements in a pandas Series are of 
the same data type.
    
Although the Series itself can hold different types of data across different instances, 
within a single Series, all elements will be cast to the same type if possible.

In [59]:
ser = pd.Series([1, 'Alice', 3.5])  # Mixed data types

print(ser)
print(ser.dtypes)  # Checking the data type

0        1
1    Alice
2      3.5
dtype: object
object


#### 2. DataFrame: 
A DataFrame is a two-dimensional, **size-mutable**, and potentially 
**heterogeneous** tabular data structure with labeled axes (rows and columns).

_Note_: **heterogeneous** means that it can hold different data types in different columns.

In [60]:
# Creating Own DataFrames:
data = {"Name":["Jhon","Peter","Lisa"],      # String
        "Age":[24,27,30],                    # integer
        "Salary":[25000.0,30000.0,35000.0]}  #Float   
df = pd.DataFrame(data)
print(df)

    Name  Age   Salary
0   Jhon   24  25000.0
1  Peter   27  30000.0
2   Lisa   30  35000.0


**Size-mutuable:** means that the number of rows and columns in a pandas DataFrame can be 
changed after its creation.

We can: 
1. Add new rows and columns.
2. Remove existing rows and columns

In [61]:
df = pd.DataFrame({'A':[1,2,3,4], 'B':[11,12,13,14]})

#Adding new column
df['C'] = [21,22,23,24]

#Removing column A
df.drop('A',axis = 1, inplace = True)                          # axis= 1 means column and axis = 0 means row

print(df)

    B   C
0  11  21
1  12  22
2  13  23
3  14  24


### REMEMBER

- Import the package, aka import pandas as pd

- A table of data is stored as a pandas DataFrame

- Each column in a DataFrame is a Series

- You can do things by applying a method to a DataFrame or Series

_______________________________________________________________________________________________________________________________________________________

## Loading Data:

**How do I read and write tabular data?**

One of the primary uses of pandas is loading data into a DataFrame from various formats like CSV, Excel, JSON, etc.

- Pandas versatility: Supports multiple file formats or data sources natively.

- Common formats: Include CSV, Excel, SQL, JSON, Parquet, etc.

- Prefix convention: Each file format or source has a corresponding function with the prefix read_*.

- Examples: read_excel(), read_json().

In [12]:
# Loading csv(comma separated values) file
import pandas as pd
data1 = pd.read_csv("C:/Users/hp/Desktop/Customers_Data.csv",encoding='latin-1')
print(data1)

       CustomerKey Prefix  FirstName LastName   BirthDate MaritalStatus  \
0            11000    MR.        JON     YANG    4/8/1966             M   
1            11001    MR.     EUGENE    HUANG   5/14/1965             S   
2            11002    MR.      RUBEN   TORRES   8/12/1965             M   
3            11003    MS.    CHRISTY      ZHU   2/15/1968             S   
4            11004   MRS.  ELIZABETH  JOHNSON    8/8/1968             S   
...            ...    ...        ...      ...         ...           ...   
18143        29479    MR.      TOMMY     TANG    7/4/1958             M   
18144        29480   MRS.       NINA     RAJI  11/10/1960             S   
18145        29481    MR.       IVAN     SURI    1/5/1960             S   
18146        29482    MR.    CLAYTON    ZHANG    3/5/1959             M   
18147        29483    MR.      JÉSUS  NAVARRO   12/8/1959             M   

      Gender                    EmailAddress AnnualIncome  TotalChildren  \
0          M       jon2

In [63]:
# Loading Excel file
data = pd.read_excel("C:/Users/hp/Desktop/Data Analyst/Excel FIle/ESD.xlsx")
print(data)

       EEID        Full Name                 Job Title  Department  \
0    E02387      Emily Davis                Sr. Manger          IT   
1    E04105    Theodore Dinh       Technical Architect          IT   
2    E02572     Luna Sanders                  Director     Finance   
3    E02832  Penelope Jordan  Computer Systems Manager          IT   
4    E01639        Austin Vo               Sr. Analyst     Finance   
..      ...              ...                       ...         ...   
995  E03094     Wesley Young               Sr. Analyst   Marketing   
996  E01909     Lillian Khan                   Analyst     Finance   
997  E04398      Oliver Yang                  Director   Marketing   
998  E02521      Lily Nguyen               Sr. Analyst     Finance   
999  E03545      Sofia Cheng            Vice President  Accounting   

              Business Unit  Gender  Ethnicity  Age  Hire Date  Annual Salary  \
0    Research & Development  Female      Black   55 2016-04-08         141604 

**to_string()**: function is used to print entire data

In [64]:
print(data.to_string())

       EEID             Full Name                       Job Title       Department           Business Unit  Gender  Ethnicity  Age  Hire Date  Annual Salary  Bonus %        Country            City  Exit Date
0    E02387           Emily Davis                      Sr. Manger               IT  Research & Development  Female      Black   55 2016-04-08         141604     0.15  United States         Seattle 2021-10-16
1    E04105         Theodore Dinh             Technical Architect               IT           Manufacturing    Male      Asian   59 1997-11-29          99975     0.00          China       Chongqing        NaT
2    E02572          Luna Sanders                        Director          Finance     Speciality Products  Female  Caucasian   50 2006-10-26         163099     0.20  United States         Chicago        NaT
3    E02832       Penelope Jordan        Computer Systems Manager               IT           Manufacturing  Female  Caucasian   26 2019-09-27          84913     0.07  U

The number of rows returned is defined in pandas to see the maximum rows we can use  
**options.display.max_rows** 

In [35]:
print(pd.options.display.max_rows)

60


In [36]:
pd.options.display.max_rows = 999   #Increase the maximum no. og rows to display entire data

## Exploring Data in Pandas

1. Head()
2. Tail()
3. info()
4. describe()
5. shape
6. dtypes

**Head()**: It returns the headers and specified no. of rows strating from the top.

_Note_: if the no. of rows are not specified, the head() method will return top 5 rows

In [65]:
data.head()

Unnamed: 0,EEID,Full Name,Job Title,Department,Business Unit,Gender,Ethnicity,Age,Hire Date,Annual Salary,Bonus %,Country,City,Exit Date
0,E02387,Emily Davis,Sr. Manger,IT,Research & Development,Female,Black,55,2016-04-08,141604,0.15,United States,Seattle,2021-10-16
1,E04105,Theodore Dinh,Technical Architect,IT,Manufacturing,Male,Asian,59,1997-11-29,99975,0.0,China,Chongqing,NaT
2,E02572,Luna Sanders,Director,Finance,Speciality Products,Female,Caucasian,50,2006-10-26,163099,0.2,United States,Chicago,NaT
3,E02832,Penelope Jordan,Computer Systems Manager,IT,Manufacturing,Female,Caucasian,26,2019-09-27,84913,0.07,United States,Chicago,NaT
4,E01639,Austin Vo,Sr. Analyst,Finance,Manufacturing,Male,Asian,55,1995-11-20,95409,0.0,United States,Phoenix,NaT


**Tail()**: It returns the headers and specified no. of rows strating from the bottom.
_Note_: If the no. of rows are not specified, the tail() method will return bottom 5 rowsws

In [66]:
data.tail()

Unnamed: 0,EEID,Full Name,Job Title,Department,Business Unit,Gender,Ethnicity,Age,Hire Date,Annual Salary,Bonus %,Country,City,Exit Date
995,E03094,Wesley Young,Sr. Analyst,Marketing,Speciality Products,Male,Caucasian,33,2016-09-18,98427,0.0,United States,Columbus,NaT
996,E01909,Lillian Khan,Analyst,Finance,Speciality Products,Female,Asian,44,2010-05-31,47387,0.0,China,Chengdu,2018-01-08
997,E04398,Oliver Yang,Director,Marketing,Speciality Products,Male,Asian,31,2019-06-10,176710,0.15,United States,Miami,NaT
998,E02521,Lily Nguyen,Sr. Analyst,Finance,Speciality Products,Female,Asian,33,2012-01-28,95960,0.0,China,Chengdu,NaT
999,E03545,Sofia Cheng,Vice President,Accounting,Corporate,Female,Asian,63,2020-07-26,216195,0.31,United States,Miami,NaT


**Info()**: It provides a summary that include the datatypes of each column and non null values present

In [67]:
data.info()         #Summary of the dataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   EEID           1000 non-null   object        
 1   Full Name      1000 non-null   object        
 2   Job Title      1000 non-null   object        
 3   Department     1000 non-null   object        
 4   Business Unit  1000 non-null   object        
 5   Gender         1000 non-null   object        
 6   Ethnicity      1000 non-null   object        
 7   Age            1000 non-null   int64         
 8   Hire Date      1000 non-null   datetime64[ns]
 9   Annual Salary  1000 non-null   int64         
 10  Bonus %        1000 non-null   float64       
 11  Country        1000 non-null   object        
 12  City           1000 non-null   object        
 13  Exit Date      85 non-null     datetime64[ns]
dtypes: datetime64[ns](2), float64(1), int64(2), object(9)
memory usage: 109.5

**describe()**: method provides a quick overview of the numerical data in a DataFrame.

In [68]:
data.describe()         # Statistics Summary

Unnamed: 0,Age,Hire Date,Annual Salary,Bonus %,Exit Date
count,1000.0,1000,1000.0,1000.0,85
mean,44.382,2012-04-07 02:54:14.400000,113217.365,0.08866,2016-11-02 18:04:14.117647104
min,25.0,1992-01-09 00:00:00,40063.0,0.0,1994-12-18 00:00:00
25%,35.0,2007-02-14 00:00:00,71430.25,0.0,2014-12-25 00:00:00
50%,45.0,2014-02-15 12:00:00,96557.0,0.0,2019-05-23 00:00:00
75%,54.0,2018-06-22 00:00:00,150782.25,0.15,2021-04-09 00:00:00
max,65.0,2021-12-26 00:00:00,258498.0,0.4,2022-08-17 00:00:00
std,11.246981,,53545.985644,0.117856,


**dtypes:** An attribute of a DataFrame or Series, used to check data types of columns.

No brackets: Since it's an attribute, brackets are not required when accessing it.

In [69]:
# dtypes is attribute
data.dtypes            # Checking column dataTypes

EEID                     object
Full Name                object
Job Title                object
Department               object
Business Unit            object
Gender                   object
Ethnicity                object
Age                       int64
Hire Date        datetime64[ns]
Annual Salary             int64
Bonus %                 float64
Country                  object
City                     object
Exit Date        datetime64[ns]
dtype: object

**shape:** Returns a tuple representing the dimensions of the DataFrame (rows, columns).

In [70]:
data.shape            # Display the how many rows and column in a DataFrame

(1000, 14)

### How to save data in a spreadsheet.

to_* Methods: The to_* methods in pandas facilitate the export of data from a DataFrame or Series to various file formats.

to_excel() Method: Specifically, the to_excel() method is used to save data as an Excel file.

Customizations:

sheet_name Parameter:
This parameter allows users to define a custom name for the worksheet (e.g., sheet_name ='passengers'), overriding the default name of Sheet1.

index Parameter:
Setting (index=False) ensures that the row index labels are not included in the saved spreadsheet.

In [15]:
data.to_excel('ESD.xlsx', sheet_name = 'Details', index = False)   # Save the data into excel file
print('Successfully Saved')

Successfully Saved


### REMEMBER

Getting data in to pandas from many different file formats or data sources is supported by read_* functions.

Exporting data out of pandas is provided by different to_*methods.

The head/tail/info methods and the dtypes attribute are convenient for a first check.

_______________________________________________________________________________________________________________________________________________________

## ATTRIBUTES in Pandas

Attributes are characteristics or properties of a pandas object that return a value or information about that object without performing any actions. They don't need parentheses.

**Example:** .shape, .columns, .dtype

**_Series Attribute:_**
1. .index: Returns the index (labels) of the Series.
2. .values: Returns the underlying data as a NumPy array.
3. .dtype: Returns the data type of the elements in the Series.
4. .name: Returns or sets the name of the Series.

In [71]:
#1. .index
ser = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(ser.index)

#2. .values
ser = pd.Series([10, 20, 30])
print(ser.values)

#3. .dtypes
ser = pd.Series([10, 20, 30])
print(ser.dtype)

#4. .name
ser = pd.Series([10, 20, 30], name='my_series')
print(ser.name)

Index(['a', 'b', 'c'], dtype='object')
[10 20 30]
int64
my_series


**_DataFrame Attributes_**

1. .columns: Returns the labels of columns in the DataFrame.
2. .shape: Returns a tuple representing the dimensions of the DataFrame (rows, columns).
3. .dtypes: Returns the data types of each column in the DataFrame.
4. .size: Returns the number of elements (rows × columns) in the DataFrame.
5. .ndim: Returns the number of dimensions of the DataFrame (should always be 2).
6. .empty: Returns True if the DataFrame is empty, otherwise False.

In [72]:
import pandas as pd

#1. .columns
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.columns)

#2. .shape
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.shape)

#3. .dtypes
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4.5, 5.5, 6.5]})
print(df.dtypes)

#4. .size
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.size)

#5. .ndim
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.ndim)

#6. .empty
df = pd.DataFrame({})
print(df.empty)

Index(['A', 'B'], dtype='object')
(3, 2)
A      int64
B    float64
dtype: object
6
2
True


## How do I select specific columns from a DataFrame?


Selecting a Column: To select a single column from a DataFrame, use square brackets  [] along with the column name.

In [73]:
# Selecting Single column
data['Full Name']

0          Emily Davis
1        Theodore Dinh
2         Luna Sanders
3      Penelope Jordan
4            Austin Vo
            ...       
995       Wesley Young
996       Lillian Khan
997        Oliver Yang
998        Lily Nguyen
999        Sofia Cheng
Name: Full Name, Length: 1000, dtype: object

__Returned Object:__ 
Each column in a DataFrame is represented as a Series. Therefore, when a single column is selected, the output is a pandas Series.

#### Selecting Multiple Columns:

To select multiple columns from a DataFrame, use a list of column names inside brackets [].

Bracket Explanation:
The inner brackets create a Python list of the column names, while the outer brackets select those columns from the DataFrame.

In [74]:
# Selecting Multiple Columns using []
data[['EEID','Full Name']]

Unnamed: 0,EEID,Full Name
0,E02387,Emily Davis
1,E04105,Theodore Dinh
2,E02572,Luna Sanders
3,E02832,Penelope Jordan
4,E01639,Austin Vo
...,...,...
995,E03094,Wesley Young
996,E01909,Lillian Khan
997,E04398,Oliver Yang
998,E02521,Lily Nguyen


In [75]:
data[['EEID','Full Name']].shape

# The selection returned a DataFrame with 1000 rows and 2 columns. Remember, a DataFrame is 2-dimensional with both a row and column dimension.

(1000, 2)

#### How do I filter specific rows from a DataFrame?

In [76]:
# Create a dataFrame

emp = {
    'Name': ['Alice','Bob','charlie','Dev'],
    'Age': [20, 18, 22, 23],
    'Marks': [28, 48, 20, 35]
}

df = pd.DataFrame(emp)
df

Unnamed: 0,Name,Age,Marks
0,Alice,20,28
1,Bob,18,48
2,charlie,22,20
3,Dev,23,35


In [77]:
# Filtering rows where age is greater than 30:

df_filtered = df[df['Marks']>30]
print(df_filtered)

  Name  Age  Marks
1  Bob   18     48
3  Dev   23     35


_Note:_ When using multiple conditions in filtering with pandas, you need to enclose each condition in parentheses **()** and use logical operators like **& (AND), | (OR)**.

In [78]:
#  Filtering using multiple conditions:

df_filtered1 = df[(df['Age']>=20) & (df['Name']=='Alice')]
print(df_filtered1)

    Name  Age  Marks
0  Alice   20     28


**loc[ ]:** is used for label-based indexing in a DataFrame, allowing access to rows and columns by their labels.

Syntax: df.loc[row_labels, column_labels]

#### Correct way to select multiple columns

_Selecting Multiple Columns:_

If you want to select multiple columns ('Name' and 'City'), you need to provide a list for the column labels.

In [79]:
res = df.loc[2,['Name','Marks']]
res

Name     charlie
Marks         20
Name: 2, dtype: object

#### Selecting Rows:

Use a single label, a list of labels, or slices (e.g., df.loc[1] for the second row).

In [80]:
df.loc[2]

Name     charlie
Age           22
Marks         20
Name: 2, dtype: object

#### Combining Rows and Columns:

Access specific rows and columns together (e.g., df.loc[0:2, ['column1', 'column2']]).

In [81]:
df.loc[1:3,['Name','Age']]

Unnamed: 0,Name,Age
1,Bob,18
2,charlie,22
3,Dev,23


#### Selecting Columns:

Specify column labels (e.g., df.loc[:, 'column_name'] for all rows of a specific column).

In [82]:
df.loc[:,['Name']]             # It print all rows of Name Column

Unnamed: 0,Name
0,Alice
1,Bob
2,charlie
3,Dev


#### Boolean Indexing:

Filter rows based on conditions (e.g., df.loc[df['column_name'] > value]).

In [83]:
df.loc[df['Marks']>=30]

Unnamed: 0,Name,Age,Marks
1,Bob,18,48
3,Dev,23,35


#### When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data.

For example, to assign the name anonymous to the first 3 elements of the fourth column:

In [84]:
df.iloc[1:3, 1] = 'Wrong Data'
df

  df.iloc[1:3, 1] = 'Wrong Data'


Unnamed: 0,Name,Age,Marks
0,Alice,20,28
1,Bob,Wrong Data,48
2,charlie,Wrong Data,20
3,Dev,23,35


### REMEMBER

- When selecting subsets of data, square brackets [] are used.

- Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.

- Select specific rows and/or columns using loc when using the row and column names.

- Select specific rows and/or columns using iloc when using the positions in the table.

- You can assign new values to a selection based on loc/iloc.

________________________________________________________________________________________________________________________________________________________