# DataFrames
#### Table of Contents
- Basics
- Creating DataFrames
- Accessing Elements
- DataFrame Attributes
- Modifying DataFrames
- DataFrame Methods

## Basics
* A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
* Each observation is represented by a single row, and each parameter by a single column.
* Each column can hold different data type.

![Pandas_2](https://github.com/abdurahimank/Pandas_Tutorial/blob/main/images/Pandas_2.png?raw=true)

#### DataFrame Attributes and Methods
DataFrames provide numerous attributes and methods for data manipulation and analysis, including:
- ```shape```: Returns the dimensions (number of rows and columns) of the DataFrame.
- ```info()```: Provides a summary of the DataFrame, including data types and non-null counts.
- ```describe()```: Generates summary statistics for numerical columns.
- ```head()```, ```tail()```: Displays the first or last n rows of the DataFrame.
- ```mean()```, ```sum()```, ```min()```, ```max()```: Calculate summary statistics for columns.
- ```sort_values()```: Sort the DataFrame by one or more columns.
- ```groupby()```: Group data based on specific columns for aggregation.
- ```fillna()```, ```drop()```, ```rename()```: Handle missing values, drop columns, or rename columns.
- ```apply()```: Apply a function to each element, row, or column of the DataFrame.

## Creating DataFrames

In [7]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}
print(type(mydataset), mydataset)
df = pd.DataFrame(mydataset)
print(df)

<class 'dict'> {'cars': ['BMW', 'Volvo', 'Ford'], 'passings': [3, 7, 2]}
    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


In [5]:
import pandas as pd
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)

   calories  duration
0       420        50
1       380        40
2       390        45


#### Labeling rows

In [6]:
# Custom index names
myvar.index = ["abc", "def", "ghi"]
myvar

Unnamed: 0,calories,duration
abc,420,50
def,380,40
ghi,390,45


In [1]:
import pandas as pd

data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

df = pd.DataFrame(data)

print(df) 

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


In [8]:
# labelling when DF is created
import numpy as np
import pandas as pd

data = {"col1":np.random.randint(5, size=5),
        "col2":np.random.randint(5, size=5),
        "col3":np.random.randint(5, size=5)}
# print(data)
df = pd.DataFrame(data, index=["row 1", "row 2", "row 3", "row 4", "row 5"])
df

Unnamed: 0,col1,col2,col3
row 1,4,2,1
row 2,4,1,0
row 3,4,0,1
row 4,1,0,0
row 5,4,2,1


In [1]:
# labelling after DF is created
import numpy as np
import pandas as pd

data = {"col1":np.random.randint(5, size=5),
        "col2":np.random.randint(5, size=5),
        "col3":np.random.randint(5, size=5)}
# print(data)
df = pd.DataFrame(data)
print(df)

df.index=['row_' + str(i) for i in range(1, 6)]
print(df)

   col1  col2  col3
0     1     4     2
1     1     4     2
2     2     4     3
3     2     4     3
4     3     3     3
       col1  col2  col3
row_1     1     4     2
row_2     1     4     2
row_3     2     4     3
row_4     2     4     3
row_5     3     3     3


In [None]:
# changing any column to index
df.set_index("column name", inplace=True)
df.index_col = 0

#### Hierarchical Indexing

In [2]:
import pandas as pd
raw_data = {'city': ['Tripoli', 'Tripoli', 'Rome', 'Rome', 'Sydney', 'Sydney'],
            'rank': ['1st', '2nd', '1st', '2nd', '1st', '2nd'],
            'name': ['Noor', 'Adam', 'Kevin', 'Raihana', 'Raj', 'Mahdi'],
            'score1': [44, 48, 30, 41, 39, 44],
            'score2': [67, 63, 55, 70, 64, 77]}

df = pd.DataFrame(raw_data)
df

Unnamed: 0,city,rank,name,score1,score2
0,Tripoli,1st,Noor,44,67
1,Tripoli,2nd,Adam,48,63
2,Rome,1st,Kevin,30,55
3,Rome,2nd,Raihana,41,70
4,Sydney,1st,Raj,39,64
5,Sydney,2nd,Mahdi,44,77


In [3]:
df.set_index(['city', 'rank'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,city,rank,name,score1,score2
city,rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Tripoli,1st,Tripoli,1st,Noor,44,67
Tripoli,2nd,Tripoli,2nd,Adam,48,63
Rome,1st,Rome,1st,Kevin,30,55
Rome,2nd,Rome,2nd,Raihana,41,70
Sydney,1st,Sydney,1st,Raj,39,64
Sydney,2nd,Sydney,2nd,Mahdi,44,77


In [6]:
df.set_index(['city', 'rank'], drop=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,name,score1,score2
city,rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tripoli,1st,Noor,44,67
Tripoli,2nd,Adam,48,63
Rome,1st,Kevin,30,55
Rome,2nd,Raihana,41,70
Sydney,1st,Raj,39,64
Sydney,2nd,Mahdi,44,77


In [7]:
df.index

RangeIndex(start=0, stop=6, step=1)

#### changing column names

In [12]:
import numpy as np
import pandas as pd

data = {"col1":np.random.randint(5, size=5),
        "col2":np.random.randint(5, size=5),
        "col3":np.random.randint(5, size=5)}
# print(data)
df = pd.DataFrame(data, index=["row 1", "row 2", "row 3", "row 4", "row 5"])
print(df)

df.columns = ["column 1", "column 2", "column 3"]
df

       col1  col2  col3
row 1     1     1     1
row 2     3     4     3
row 3     1     2     2
row 4     0     1     0
row 5     0     3     3


Unnamed: 0,column 1,column 2,column 3
row 1,1,1,1
row 2,3,4,3
row 3,1,2,2
row 4,0,1,0
row 5,0,3,3


In [None]:
# with rename function
df.rename(columns={"oldname":"newname"})

In [None]:
# if a data does not have column names, ie headers. You can give that by specifying names attribute when creating dataframe
df = pd.DataFrame(data, index=["row 1", "row 2", "row 3", "row 4", "row 5"], names=["col 1", "col 2", "col 3"])
print(df)

## Accessing Elements
- Pandas allows ```.loc```, ```.iloc``` methods for selecting **rows**.
- Using square brackets ```([])``` is also allowed, especially for selecting **columns**.

In [13]:
import pandas as pd
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

In [14]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
0,35.467,1785387,9984670,0.913,America
1,63.951,2833687,640679,0.888,Europe
2,80.94,3874437,357114,0.916,Europe
3,60.665,2167744,301336,0.873,Europe
4,127.061,4602367,377930,0.891,Asia
5,64.511,2950039,242495,0.907,Europe
6,318.523,17348075,9525067,0.915,America


In [15]:
df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [16]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


#### By index
```df[index]```

```df.iloc[row, column]```

In [23]:
df[:1]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America


In [17]:
df[0:2]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe


In [31]:
print(df.iloc[0])

Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object


In [32]:
print(df.iloc[[0, 2]])

         Population      GDP  Surface Area    HDI Continent
Canada       35.467  1785387       9984670  0.913   America
Germany      80.940  3874437        357114  0.916    Europe


In [34]:
print(df.iloc[2:5])

         Population      GDP  Surface Area    HDI Continent
Germany      80.940  3874437        357114  0.916    Europe
Italy        60.665  2167744        301336  0.873    Europe
Japan       127.061  4602367        377930  0.891      Asia


#### By index name
```df.loc[row, column]```

In [35]:
print(df.loc['Canada'])

Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object


In [36]:
print(df.loc[['Canada', 'Italy']])

        Population      GDP  Surface Area    HDI Continent
Canada      35.467  1785387       9984670  0.913   America
Italy       60.665  2167744        301336  0.873    Europe


In [38]:
print(df.loc['Canada':'Italy'])

         Population      GDP  Surface Area    HDI Continent
Canada       35.467  1785387       9984670  0.913   America
France       63.951  2833687        640679  0.888    Europe
Germany      80.940  3874437        357114  0.916    Europe
Italy        60.665  2167744        301336  0.873    Europe


### Accessing Columns

In [39]:
df["GDP"]

Canada             1785387
France             2833687
Germany            3874437
Italy              2167744
Japan              4602367
United Kingdom     2950039
United States     17348075
Name: GDP, dtype: int64

In [24]:
df.HDI

Canada            0.913
France            0.888
Germany           0.916
Italy             0.873
Japan             0.891
United Kingdom    0.907
United States     0.915
Name: HDI, dtype: float64

In [42]:
df[["GDP", "HDI"]]

Unnamed: 0,GDP,HDI
Canada,1785387,0.913
France,2833687,0.888
Germany,3874437,0.916
Italy,2167744,0.873
Japan,4602367,0.891
United Kingdom,2950039,0.907
United States,17348075,0.915


### Accessing both rows and columns

In [44]:
df.iloc[1:4, 2:4]

Unnamed: 0,Surface Area,HDI
France,640679,0.888
Germany,357114,0.916
Italy,301336,0.873


In [43]:
df.loc["Canada":"Italy", ["GDP", "HDI"]]

Unnamed: 0,GDP,HDI
Canada,1785387,0.913
France,2833687,0.888
Germany,3874437,0.916
Italy,2167744,0.873


### Conditional Selection

In [46]:
df.loc[df["Population"] > 70]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Germany,80.94,3874437,357114,0.916,Europe
Japan,127.061,4602367,377930,0.891,Asia
United States,318.523,17348075,9525067,0.915,America


In [47]:
df.loc[df["Population"] > 70, ["GDP", "HDI"]]

Unnamed: 0,GDP,HDI
Germany,3874437,0.916
Japan,4602367,0.891
United States,17348075,0.915


## DataFrame Attributes
* The **head()** method returns the headers and a specified number of rows, starting from the top.
* if the number of rows is not specified, the head() method will return the top 5 rows.

In [2]:
import pandas as pd
df = pd.read_csv('data.csv')

print(df.head(10))

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.0
6        60    110       136     374.0
7        45    104       134     253.3
8        30    109       133     195.1
9        60     98       124     269.0


* The **tail()** method returns the headers and a specified number of rows, starting from the bottom.

In [3]:
print(df.tail())

     Duration  Pulse  Maxpulse  Calories
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4


In [4]:
# The DataFrames object has a method called info(), that gives you more information about the data set.
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None


In [8]:
# columns
df.columns

Index(['Duration', 'Pulse', 'Maxpulse', 'Calories'], dtype='object')

In [9]:
# index
df.index

RangeIndex(start=0, stop=169, step=1)

In [10]:
df.size

676

In [11]:
df.shape

(169, 4)

In [14]:
df.describe()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,169.0,169.0,169.0,164.0
mean,63.846154,107.461538,134.047337,375.790244
std,42.299949,14.510259,16.450434,266.379919
min,15.0,80.0,100.0,50.3
25%,45.0,100.0,124.0,250.925
50%,60.0,105.0,131.0,318.6
75%,60.0,111.0,141.0,387.6
max,300.0,159.0,184.0,1860.4


In [13]:
df.dtypes

Duration      int64
Pulse         int64
Maxpulse      int64
Calories    float64
dtype: object

## Modifying DataFrames
In most cases modifying funcions just creating new dataframes without modifying the exis one., So if you want to modify the existing one then assign it to itself.

In [48]:
import pandas as pd
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], 
   index = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom', 'United States'])
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


#### Dropping values

In [49]:
df.drop("Canada")

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [50]:
df.drop(["Canada", "Japan"])

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [51]:
df.drop(columns=["GDP", "HDI"])

Unnamed: 0,Population,Surface Area,Continent
Canada,35.467,9984670,America
France,63.951,640679,Europe
Germany,80.94,357114,Europe
Italy,60.665,301336,Europe
Japan,127.061,377930,Asia
United Kingdom,64.511,242495,Europe
United States,318.523,9525067,America


In [52]:
df.drop(["Italy", "Japan"], axis=0)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [53]:
df.drop(["GDP", "HDI"], axis=1)

Unnamed: 0,Population,Surface Area,Continent
Canada,35.467,9984670,America
France,63.951,640679,Europe
Germany,80.94,357114,Europe
Italy,60.665,301336,Europe
Japan,127.061,377930,Asia
United Kingdom,64.511,242495,Europe
United States,318.523,9525067,America


#### Adding a Column

In [54]:
langs = pd.Series(
    ['French', 'German', 'Italian'],
    index=['France', 'Germany', 'Italy'],
    name='Language'
)
langs

France      French
Germany     German
Italy      Italian
Name: Language, dtype: object

In [55]:
df['Language'] = langs
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,
United Kingdom,64.511,2950039,242495,0.907,Europe,
United States,318.523,17348075,9525067,0.915,America,


#### Replacing values in column

In [56]:
df["Language"] = "English"
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


#### Renaming column and row names

In [58]:
df.rename(
    columns={
        'HDI': 'Human Development Index',
        'Anual Popcorn Consumption': 'APC'
    }, index={
        'United States': 'USA',
        'United Kingdom': 'UK',
        'Argentina': 'AR'
    })

Unnamed: 0,Population,GDP,Surface Area,Human Development Index,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
UK,64.511,2950039,242495,0.907,Europe,English
USA,318.523,17348075,9525067,0.915,America,English


In [59]:
# only adding column have modified original DataFrame.
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


## DataFrame Methods

In [25]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp':pd.Series(28 + 10*np.random.randn(10)), 
                'rain':pd.Series(100 + 50*np.random.randn(10)),
             'location':list('AAAAABBBBB')})
df

Unnamed: 0,temp,rain,location
0,22.330169,111.984767,A
1,41.414524,39.839171,A
2,24.125791,163.888518,A
3,23.024062,154.364226,A
4,32.596523,57.266291,A
5,27.848918,165.580745,B
6,36.008795,124.497195,B
7,6.023301,56.115259,B
8,27.902979,37.60801,B
9,30.190074,154.28878,B


#### df.info()

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   temp      10 non-null     float64
 1   rain      10 non-null     float64
 2   location  10 non-null     object 
dtypes: float64(2), object(1)
memory usage: 368.0+ bytes


#### df.Describe()
- describe method by default provides details of only numeric fields.

In [27]:
df.describe()

Unnamed: 0,temp,rain
count,10.0,10.0
mean,27.146514,106.543296
std,9.531552,53.60871
min,6.023301,37.60801
25%,23.299494,56.403017
50%,27.875949,118.240981
75%,31.994911,154.345365
max,41.414524,165.580745


You can use ```include```argument to white list data types that has to be included in the result.

In [28]:
print(df.describe(include=['object']))

       location
count        10
unique        2
top           A
freq          5
