# Pandas

Pandas is a powerful and flexible open-source data analysis and data manipulation library for Python. 
It provides data structures and functions needed to work with structured data seamlessly. 

#### **Key Features of Pandas**
- **Data Structures**: Provides two primary data structures: Series and DataFrame.
- **Data Manipulation**: Easily manipulate, filter, and transform data.
- **Data Cleaning**: Handle missing data, duplicates, and outliers.
- **Input/Output**: Read and write data from various file formats (CSV, Excel, SQL, etc.).
- **Group Operations**: Efficiently group and aggregate data.
- **Time Series**: Support for time series data with built-in functionality for date and time manipulation.

#### **Core Data Structures**
- **Series**: A one-dimensional labeled array that can hold any data type (integers, strings, floats, etc.). It can be thought of as a column in a table.

- **DataFrame**: A two-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or SQL table.

In [1]:
import pandas as pd

# Creating a Series
data = pd.Series([1, 2, 3, 4])
print(data)
print(type(data))

0    1
1    2
2    3
3    4
dtype: int64
<class 'pandas.core.series.Series'>


In [2]:
# Creating a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c']
})
print(df)

   A  B
0  1  a
1  2  b
2  3  c


#### 3. **Data Manipulation Techniques**
- **Indexing and Selecting Data**:
  - Use `.loc[]` for label-based indexing.
  - Use `.iloc[]` for position-based indexing.

- **Filtering Data**: Use boolean indexing to filter rows based on conditions.

- **Adding and Modifying Columns**:

- **Handling Missing Data**:
  - Use `dropna()` to remove missing values.
  - Use `fillna()` to fill missing values with a specified value or method.

#### 4. **Aggregation and Grouping**
- **Group By**: Use the `groupby()` method to group data and perform aggregate functions like `sum()`, `mean()`, etc.

#### 5. **Input/Output Operations**
- **Reading Data**: Load data from various file formats.
  ```python
  df = pd.read_csv('data.csv')  # Read CSV file
  ```

- **Writing Data**: Save DataFrames to various formats.
  ```python
  df.to_excel('output.xlsx', index=False)  # Write DataFrame to Excel
  ```


In [4]:
# Selecting a column
print(df['A'])

# Selecting a row
print(df.loc[0])

0    1
1    2
2    3
Name: A, dtype: int64
A    1
B    a
Name: 0, dtype: object


In [5]:
filtered_df = df[df['A'] > 1]
print(filtered_df)

   A  B
1  2  b
2  3  c


In [6]:
df['C'] = df['A'] * 2  # Add a new column
df['A'] = df['A'] + 1  # Modify an existing column

In [7]:
df.fillna(0, inplace=True)

In [8]:
grouped_df = df.groupby('B').sum()  # Group by column B and sum the values

In [9]:
import pandas as pd
import numpy as np

In [14]:
from sklearn.datasets import load_diabetes
diabetes_dataset = load_diabetes()

In [15]:
type(diabetes_dataset)

sklearn.utils._bunch.Bunch

In [16]:
print(diabetes_dataset)

{'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286131, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04688253,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452873, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00422151,  0.00306441]]), 'target': array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
  

#### Inspecting a DataFrame

In [17]:
### creating a Pandas DataFrame

diabetes_df = pd.DataFrame(diabetes_dataset.data, columns = diabetes_dataset.feature_names)

In [18]:
### first 5 rows in a DataFrame

diabetes_df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [19]:
# finding the number of rows & columns
diabetes_df.shape

(442, 10)

In [21]:
# last 5 rows of the DataFrame
diabetes_df.tail()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
437,0.041708,0.05068,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.05068,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.05068,-0.015906,0.017293,-0.037344,-0.01384,-0.024993,-0.01108,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.02656,0.044529,-0.02593
441,-0.045472,-0.044642,-0.07303,-0.081413,0.08374,0.027809,0.173816,-0.039493,-0.004222,0.003064


In [22]:
type(diabetes_df)

pandas.core.frame.DataFrame

In [23]:
# informations about the DataFrame
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
dtypes: float64(10)
memory usage: 34.7 KB


In [24]:
# finding the number of missing values
diabetes_df.isnull().sum()

age    0
sex    0
bmi    0
bp     0
s1     0
s2     0
s3     0
s4     0
s5     0
s6     0
dtype: int64

In [26]:
# counting the values based on the labels
diabetes_df.value_counts('sex')

sex
-0.044642    235
 0.050680    207
dtype: int64

In [28]:
# group the values based on the mean
diabetes_df.groupby('bp').mean()

Unnamed: 0_level_0,age,sex,bmi,s1,s2,s3,s4,s5,s6
bp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
-0.112399,-0.027310,-0.044642,-0.066563,-0.049727,-0.041397,0.000779,-0.039493,-0.035816,-0.009362
-0.108956,-0.099961,-0.044642,-0.067641,-0.074494,-0.072712,0.015505,-0.039493,-0.049872,-0.009362
-0.102070,-0.049105,-0.044642,-0.064408,-0.002945,-0.015406,0.063367,-0.047243,-0.033246,-0.054925
-0.100934,0.001751,-0.044642,-0.039618,-0.029088,-0.030124,0.044958,-0.050195,-0.068332,-0.129483
-0.098627,-0.020045,-0.044642,-0.046085,-0.075870,-0.059873,-0.017629,-0.039493,-0.051404,-0.046641
...,...,...,...,...,...,...,...,...,...
0.101058,0.074401,-0.044642,0.031517,0.046589,0.036890,0.015505,-0.002592,0.033654,0.044485
0.104501,-0.007331,0.003019,0.027206,0.016318,-0.015249,0.085456,-0.039493,0.004890,0.044485
0.107944,0.027178,0.018906,0.002417,0.026409,0.008081,0.002006,0.006141,0.052089,0.065196
0.125158,-0.001882,-0.044642,0.033673,0.024574,0.026243,-0.010266,-0.002592,0.026717,0.061054


### 1. Summary Statistics
You can get a summary of the statistics for numerical columns using the `.describe()` method:

In [29]:
summary_stats = diabetes_df.describe()
print(summary_stats)

                age           sex           bmi            bp            s1  \
count  4.420000e+02  4.420000e+02  4.420000e+02  4.420000e+02  4.420000e+02   
mean  -2.511817e-19  1.230790e-17 -2.245564e-16 -4.797570e-17 -1.381499e-17   
std    4.761905e-02  4.761905e-02  4.761905e-02  4.761905e-02  4.761905e-02   
min   -1.072256e-01 -4.464164e-02 -9.027530e-02 -1.123988e-01 -1.267807e-01   
25%   -3.729927e-02 -4.464164e-02 -3.422907e-02 -3.665608e-02 -3.424784e-02   
50%    5.383060e-03 -4.464164e-02 -7.283766e-03 -5.670422e-03 -4.320866e-03   
75%    3.807591e-02  5.068012e-02  3.124802e-02  3.564379e-02  2.835801e-02   
max    1.107267e-01  5.068012e-02  1.705552e-01  1.320436e-01  1.539137e-01   

                 s2            s3            s4            s5            s6  
count  4.420000e+02  4.420000e+02  4.420000e+02  4.420000e+02  4.420000e+02  
mean   3.918434e-17 -5.777179e-18 -9.042540e-18  9.293722e-17  1.130318e-17  
std    4.761905e-02  4.761905e-02  4.761905e-02  4.761

In [31]:
mean_value = diabetes_df['age'].mean()
print(mean_value)

-2.511816797794472e-19


In [32]:
median_value = diabetes_df['age'].median()
print(median_value)

0.005383060374248237


In [34]:
std_dev = diabetes_df['age'].std()
print(std_dev)

0.047619047619047644


In [35]:
correlation_matrix = diabetes_df.corr()
print(correlation_matrix)

          age       sex       bmi        bp        s1        s2        s3  \
age  1.000000  0.173737  0.185085  0.335428  0.260061  0.219243 -0.075181   
sex  0.173737  1.000000  0.088161  0.241010  0.035277  0.142637 -0.379090   
bmi  0.185085  0.088161  1.000000  0.395411  0.249777  0.261170 -0.366811   
bp   0.335428  0.241010  0.395411  1.000000  0.242464  0.185548 -0.178762   
s1   0.260061  0.035277  0.249777  0.242464  1.000000  0.896663  0.051519   
s2   0.219243  0.142637  0.261170  0.185548  0.896663  1.000000 -0.196455   
s3  -0.075181 -0.379090 -0.366811 -0.178762  0.051519 -0.196455  1.000000   
s4   0.203841  0.332115  0.413807  0.257650  0.542207  0.659817 -0.738493   
s5   0.270774  0.149916  0.446157  0.393480  0.515503  0.318357 -0.398577   
s6   0.301731  0.208133  0.388680  0.390430  0.325717  0.290600 -0.273697   

           s4        s5        s6  
age  0.203841  0.270774  0.301731  
sex  0.332115  0.149916  0.208133  
bmi  0.413807  0.446157  0.388680  
bp   0.2

In [38]:
# removing a row
diabetes_df.drop(index=1, axis=0, inplace=True)

In [39]:
diabetes_df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041176,-0.096346
6,-0.045472,0.05068,-0.047163,-0.015999,-0.040096,-0.0248,0.000779,-0.039493,-0.062917,-0.038357


The drop() method does not modify the DataFrame in place by default. This means that after executing this line, boston_df will remain unchanged unless you assign the result back to boston_df or use the inplace=True parameter.

In [41]:
# drop a column
diabetes_df.drop(columns='s6', axis=1)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041176
6,-0.045472,0.050680,-0.047163,-0.015999,-0.040096,-0.024800,0.000779,-0.039493,-0.062917
...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529
