<a href="https://colab.research.google.com/github/asheta66/Data-Science/blob/main/Practical_on_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***What is Pandas Libraray?***

Pandas is an open-source Python library that provides data structures and data analysis tools for handling and manipulating structured data. It's widely used in data science, data analysis, and machine learning workflows due to its ease of use and powerful capabilities. Pandas excels at handling structured data, such as tabular data (similar to data in spreadsheets or SQL tables), and it offers functionalities similar to those found in SQL or Excel.

**Key features of the Pandas library include:**



1. **DataFrame:** The core data structure in Pandas is the DataFrame, which is a two-dimensional tabular data structure with labeled axes (rows and columns). It allows you to store and manipulate data in a structured and flexible manner.



In [None]:
import pandas as pd

# create a data frame

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 17, 28],
    'City': ['New York', 'San Francisco', 'Los Anglos', 'Alexandria'],
    'Salary': [100, 150, 200, 300]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,100
1,Bob,30,San Francisco,150
2,Charlie,17,Los Anglos,200
3,David,28,Alexandria,300


In [None]:
df[['Name','Salary']]

Unnamed: 0,Name,Salary
0,Alice,100
1,Bob,150
2,Charlie,200
3,David,300


In [None]:
df.iloc[1]

Name                Bob
Age                  30
City      San Francisco
Salary              150
Name: 1, dtype: object

In [None]:
# calculate some statistics on each column
df['Age'].mean()

25.0

In [None]:
print("The minimum age is",df['Age'].min())
print("The maximum age is",df['Age'].max())
print("The mean age is",df['Age'].mean())


The minimum age is 17
The maximum age is 30
The mean age is 25.0



2. **Series:** A Series is a one-dimensional labeled array that can hold different types of data (similar to a column in a DataFrame). Series are the building blocks of DataFrames.

In [None]:
ndata = [10, 20, 30, 40, 50]
s= pd.Series(ndata)

print(s.min())
print(s.max())
print(s.mean())

10
50
30.0


3. **Data Manipulation:** Pandas provides a wide range of functions for data cleaning, transformation, filtering, and aggregation. You can perform tasks like filtering rows, merging data, handling missing values, and more.


In [None]:
df


Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,100
1,Bob,30,San Francisco,150
2,Charlie,17,Los Anglos,200
3,David,28,Alexandria,300


In [None]:
df.head(2)

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,100
1,Bob,30,San Francisco,150


In [None]:
df.tail(2)

Unnamed: 0,Name,Age,City,Salary
2,Charlie,17,Los Anglos,200
3,David,28,Alexandria,300


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
 3   Salary  4 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes


In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,4.0,25.0,5.715476,17.0,23.0,26.5,28.5,30.0
Salary,4.0,187.5,85.391256,100.0,137.5,175.0,225.0,300.0


In [None]:
df

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,100
1,Bob,30,San Francisco,150
2,Charlie,17,Los Anglos,200
3,David,28,Alexandria,300


In [None]:
merged_df = pd.merge(df, df, on='Name')
merged_df

Unnamed: 0,Name,Age_x,City_x,Salary_x,Age_y,City_y,Salary_y
0,Alice,25,New York,100,25,New York,100
1,Bob,30,San Francisco,150,30,San Francisco,150
2,Charlie,17,Los Anglos,200,17,Los Anglos,200
3,David,28,Alexandria,300,28,Alexandria,300


In [None]:
# Apply a function on a column
df['New_Salary']= df['Salary'].apply(lambda x: x *1.2)
df

Unnamed: 0,Name,Age,City,Salary,New_Salary
0,Alice,25,New York,100,120.0
1,Bob,30,San Francisco,150,180.0
2,Charlie,17,Los Anglos,200,240.0
3,David,28,Alexandria,300,360.0


In [None]:
# sorting the data based on a column

sorted_df = df.sort_values(by='Age')
sorted_df

Unnamed: 0,Name,Age,City,Salary,New_Salary
2,Charlie,17,Los Anglos,200,240.0
0,Alice,25,New York,100,120.0
3,David,28,Alexandria,300,360.0
1,Bob,30,San Francisco,150,180.0


In [None]:
# create a new column based on certian condition
df['is_adult'] = df['Age'].apply(lambda x: True if x>=18 else False)
df


Unnamed: 0,Name,Age,City,Salary,New_Salary,is_adult
0,Alice,25,New York,100,120.0,True
1,Bob,30,San Francisco,150,180.0,True
2,Charlie,17,Los Anglos,200,240.0,False
3,David,28,Alexandria,300,360.0,True



4. **Data Loading and Saving:** Pandas supports reading and writing data from various file formats such as CSV, Excel, SQL databases, and more. This makes it easy to import data from different sources.

In [None]:
df1 = pd.read_csv('diabetes.csv')

df1

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [None]:
df1['Outcome'].unique()

array([1, 0])

In [None]:
# develop correlation matrix

correlation_matrix = df1.corr()
correlation_matrix

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341,0.221898
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514,0.466581
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528,0.065068
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397,0.074752
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163,0.130548
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242,0.292695
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561,0.173844
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0,0.238356
Outcome,0.221898,0.466581,0.065068,0.074752,0.130548,0.292695,0.173844,0.238356,1.0


In [None]:
df1 = df1.rename(columns={'DiabetesPedigreeFunction': 'DPF'})
df1.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DPF,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


5. **Data Analysis:** With Pandas, you can perform exploratory data analysis (EDA), compute descriptive statistics, calculate correlations, and create pivot tables.

**Exploratory Data Analysis (EDA):**

> Exploratory Data Analysis involves understanding the data, identifying patterns, and gaining insights before conducting more in-depth analyses.

