# Basics and Data Structures
#### Table of Contents
1. Pandas Basics
2. Series
3. DataFrame

# Pandas Basics
Pandas is a popular open-source data manipulation and analysis library for the Python programming language. It provides a powerful and flexible set of tools for working with structured data, making it a fundamental tool for data scientists, analysts, and engineers.

Pandas is designed to handle data in various formats, such as tabular data, time series data, and more, making it an essential part of the data processing workflow in many industries.

Here are some key features and functionalities of Pandas:

### Data Structures: 
Pandas offers two primary data structures.
- DataFrame
- Series
- Panel - 3D structure

A **DataFrame** is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

- A Pandas DataFrame will be created by loading the datasets from existing storage.
- Storage can be SQL Database, CSV file, Excel file, etc.
- It can also be created from the lists, dictionaries, and from a list of dictionaries.

**Series** represents a one-dimensional array of indexed data. It has two main components :

1. An array of actual data.
2. An associated array of indexes or data labels.

The index is used to access individual data values. You can also get a column of a dataframe as a **Series**. You can think of a Pandas series as a 1-D dataframe.

### Data Import and Export: 
Pandas makes it easy to read data from various sources, including CSV files, Excel spreadsheets, SQL databases, and more. It can also export data to these formats, enabling seamless data exchange.

### Data Merging and Joining: 
You can combine multiple DataFrames using methods like merge and join, similar to SQL operations, to create more complex datasets from different sources.

### Efficient Indexing: 
Pandas provides efficient indexing and selection methods, allowing you to access specific rows and columns of data quickly.

### Custom Data Structures: 
You can create custom data structures and manipulate data in ways that suit your specific needs, extending Pandas' capabilities.

In [1]:
import pandas as pd
import numpy as np

In [5]:
print(pd.__version__)

1.4.2


# Series
* A Pandas Series is like a column in a table.
* Series is a 1-D array, holding data values of a single variable, captured from multiple observations.
![Pandas_1](https://github.com/abdurahimank/Pandas_Tutorial/blob/main/images/Pandas_1.png?raw=true)

### Series Attributes and Methods
Pandas Series come with various attributes and methods to help you manipulate and analyze data effectively. <br>Here are a few essential ones:

- **values**: Returns the Series data as a NumPy array.
- **index**: Returns the index (labels) of the Series.
- **shape**: Returns a tuple representing the dimensions of the Series.
- **size**: Returns the number of elements in the Series.
- **mean()**, **sum()**, **min()**, **max()**: Calculate summary statistics of the data.
- **unique()**, **nunique()**: Get unique values or the number of unique values.
- **sort_values()**, **sort_index()**: Sort the Series by values or index labels.
- **isnull()**, **notnull()**: Check for missing (NaN) or non-missing values.
- **apply()**: Apply a custom function to each element of the Series.

**Creating Series from Lists and Tuples**

In [2]:
a = pd.Series([35.46, 78.89, 34.23, 97.12, 15.78])
a

0    35.46
1    78.89
2    34.23
3    97.12
4    15.78
dtype: float64

**Key/Value Objects as Series**
* You can also use a key/value object, like a dictionary, when creating a Series.
* The keys of the dictionary become the labels.

In [13]:
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)

day1    420
day2    380
day3    390
dtype: int64


In [12]:
# Selecting specific objects
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories, index = ["day1", "day2"])
print(myvar)

day1    420
day2    380
dtype: int64


### Labels
* If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.
* This label can be used to access a specified value.

In [9]:
a = pd.Series([35.46, 78.89, 34.23, 97.12, 15.78])
print(a)
print(a[0])

0    35.46
1    78.89
2    34.23
3    97.12
4    15.78
dtype: float64
35.46


* **With the index argument, you can name your own labels**.
* When you have created labels, you can access an item by referring to the label.

In [10]:
# Giving manual index values
a.index = ["Brasil", 
          "Russia", 
          "India",
          "China",
          "SA"]
print(a)
print(a["Russia"])

Brasil    35.46
Russia    78.89
India     34.23
China     97.12
SA        15.78
dtype: float64
78.89


In [3]:
a.name = "G7 Population in millions"
a

0    35.46
1    78.89
2    34.23
3    97.12
4    15.78
Name: G7 Population in millions, dtype: float64

In [5]:
a.dtype

dtype('float64')

In [6]:
a.values

array([35.46, 78.89, 34.23, 97.12, 15.78])

In [7]:
type(a.values)

numpy.ndarray

In [8]:
a.name

'G7 Population in millions'

In [9]:
a.index

RangeIndex(start=0, stop=5, step=1)

Brasil    35.46
Russia    78.89
India     34.23
China     97.12
SA        15.78
Name: G7 Population in millions, dtype: float64

In [3]:
import pandas as pd
certificates_earned = pd.Series([8, 2, 5, 6], index=['Tom', 'Kris', 'Ahmad', 'Beau'])
certificates_earned

Tom      8
Kris     2
Ahmad    5
Beau     6
dtype: int64

### Accessing Elements

In [13]:
a = pd.Series([35.46, 78.89, 34.23, 97.12, 15.78], 
             index = ["Brasil", "Russia", "India", "China", "SA"], name = "Brics Nations GDP")
a

Brasil    35.46
Russia    78.89
India     34.23
China     97.12
SA        15.78
Name: Brics Nations GDP, dtype: float64

In [16]:
# accessing by index
print(a[2])

34.23


In [None]:
# accessing by index name
print(a["China"])

In [17]:
# with "iloc" attribute
print(a.iloc[4])

15.78


In [18]:
# multiple elements with index name
print(a[["India", "China"]])

India    34.23
China    97.12
Name: Brics Nations GDP, dtype: float64


In [19]:
# multiple elements with "iloc"
print(a.iloc[[0, 4]])

Brasil    35.46
SA        15.78
Name: Brics Nations GDP, dtype: float64


#### Conditional Selection

In [21]:
a = pd.Series([35.46, 78.89, 34.23, 97.12, 15.78], 
             index = ["Brasil", "Russia", "India", "China", "SA"], name = "Brics Nations GDP")
a

Brasil    35.46
Russia    78.89
India     34.23
China     97.12
SA        15.78
Name: Brics Nations GDP, dtype: float64

In [23]:
a[a > 50]

Russia    78.89
China     97.12
Name: Brics Nations GDP, dtype: float64

### Modifying Series

In [24]:
a = pd.Series([35.46, 78.89, 34.23, 97.12, 15.78], 
             index = ["Brasil", "Russia", "India", "China", "SA"], name = "Brics Nations GDP")
a

Brasil    35.46
Russia    78.89
India     34.23
China     97.12
SA        15.78
Name: Brics Nations GDP, dtype: float64

In [25]:
a["Brasil"] = 50
a

Brasil    50.00
Russia    78.89
India     34.23
China     97.12
SA        15.78
Name: Brics Nations GDP, dtype: float64

In [26]:
a[a < 50] = 50
a

Brasil    50.00
Russia    78.89
India     50.00
China     97.12
SA        50.00
Name: Brics Nations GDP, dtype: float64

#### Differences between loc and iloc
loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).

# DataFrames
* A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
* Each observation is represented by a single row, and each parameter by a single column.
* Each column can hold different data type.
![Pandas_2](https://github.com/abdurahimank/Pandas_Tutorial/blob/main/images/Pandas_2.png?raw=true)

### DataFrame Attributes and Methods
DataFrames provide numerous attributes and methods for data manipulation and analysis, including:
- ```shape```: Returns the dimensions (number of rows and columns) of the DataFrame.
- ```info()```: Provides a summary of the DataFrame, including data types and non-null counts.
- ```describe()```: Generates summary statistics for numerical columns.
- ```head()```, ```tail()```: Displays the first or last n rows of the DataFrame.
- ```mean()```, ```sum()```, ```min()```, ```max()```: Calculate summary statistics for columns.
- ```sort_values()```: Sort the DataFrame by one or more columns.
- ```groupby()```: Group data based on specific columns for aggregation.
- ```fillna()```, ```drop()```, ```rename()```: Handle missing values, drop columns, or rename columns.
- ```apply()```: Apply a function to each element, row, or column of the DataFrame.

In [7]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}
print(type(mydataset), mydataset)
df = pd.DataFrame(mydataset)
print(df)

<class 'dict'> {'cars': ['BMW', 'Volvo', 'Ford'], 'passings': [3, 7, 2]}
    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


In [5]:
import pandas as pd
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)

   calories  duration
0       420        50
1       380        40
2       390        45


In [6]:
# Custom index names
myvar.index = ["abc", "def", "ghi"]
myvar

Unnamed: 0,calories,duration
abc,420,50
def,380,40
ghi,390,45


In [1]:
import pandas as pd

data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

df = pd.DataFrame(data)

print(df) 

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


### Accessing Elements

In [27]:
import pandas as pd
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

In [26]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
0,35.467,1785387,9984670,0.913,America
1,63.951,2833687,640679,0.888,Europe
2,80.94,3874437,357114,0.916,Europe
3,60.665,2167744,301336,0.873,Europe
4,127.061,4602367,377930,0.891,Asia
5,64.511,2950039,242495,0.907,Europe
6,318.523,17348075,9525067,0.915,America


In [28]:
df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [29]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


#### By index value
```df.iloc[row, column]```

In [31]:
print(df.iloc[0])

Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object


In [32]:
print(df.iloc[[0, 2]])

         Population      GDP  Surface Area    HDI Continent
Canada       35.467  1785387       9984670  0.913   America
Germany      80.940  3874437        357114  0.916    Europe


In [34]:
print(df.iloc[2:5])

         Population      GDP  Surface Area    HDI Continent
Germany      80.940  3874437        357114  0.916    Europe
Italy        60.665  2167744        301336  0.873    Europe
Japan       127.061  4602367        377930  0.891      Asia


#### By index name
```df.loc[row, column]```

In [35]:
print(df.loc['Canada'])

Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object


In [36]:
print(df.loc[['Canada', 'Italy']])

        Population      GDP  Surface Area    HDI Continent
Canada      35.467  1785387       9984670  0.913   America
Italy       60.665  2167744        301336  0.873    Europe


In [38]:
print(df.loc['Canada':'Italy'])

         Population      GDP  Surface Area    HDI Continent
Canada       35.467  1785387       9984670  0.913   America
France       63.951  2833687        640679  0.888    Europe
Germany      80.940  3874437        357114  0.916    Europe
Italy        60.665  2167744        301336  0.873    Europe


### Accessing Columns

In [39]:
df["GDP"]

Canada             1785387
France             2833687
Germany            3874437
Italy              2167744
Japan              4602367
United Kingdom     2950039
United States     17348075
Name: GDP, dtype: int64

In [42]:
df[["GDP", "HDI"]]

Unnamed: 0,GDP,HDI
Canada,1785387,0.913
France,2833687,0.888
Germany,3874437,0.916
Italy,2167744,0.873
Japan,4602367,0.891
United Kingdom,2950039,0.907
United States,17348075,0.915


### Accessing both rows and columns

In [44]:
df.iloc[1:4, 2:4]

Unnamed: 0,Surface Area,HDI
France,640679,0.888
Germany,357114,0.916
Italy,301336,0.873


In [43]:
df.loc["Canada":"Italy", ["GDP", "HDI"]]

Unnamed: 0,GDP,HDI
Canada,1785387,0.913
France,2833687,0.888
Germany,3874437,0.916
Italy,2167744,0.873


### Conditional Selection

In [46]:
df.loc[df["Population"] > 70]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Germany,80.94,3874437,357114,0.916,Europe
Japan,127.061,4602367,377930,0.891,Asia
United States,318.523,17348075,9525067,0.915,America


In [47]:
df.loc[df["Population"] > 70, ["GDP", "HDI"]]

Unnamed: 0,GDP,HDI
Germany,3874437,0.916
Japan,4602367,0.891
United States,17348075,0.915


### Properties
* The **head()** method returns the headers and a specified number of rows, starting from the top.
* if the number of rows is not specified, the head() method will return the top 5 rows.

In [2]:
import pandas as pd
df = pd.read_csv('data.csv')

print(df.head(10))

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.0
6        60    110       136     374.0
7        45    104       134     253.3
8        30    109       133     195.1
9        60     98       124     269.0


* The **tail()** method returns the headers and a specified number of rows, starting from the bottom.

In [3]:
print(df.tail())

     Duration  Pulse  Maxpulse  Calories
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4


In [4]:
# The DataFrames object has a method called info(), that gives you more information about the data set.
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None


In [8]:
# columns
df.columns

Index(['Duration', 'Pulse', 'Maxpulse', 'Calories'], dtype='object')

In [9]:
# index
df.index

RangeIndex(start=0, stop=169, step=1)

In [10]:
df.size

676

In [11]:
df.shape

(169, 4)

In [14]:
df.describe()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,169.0,169.0,169.0,164.0
mean,63.846154,107.461538,134.047337,375.790244
std,42.299949,14.510259,16.450434,266.379919
min,15.0,80.0,100.0,50.3
25%,45.0,100.0,124.0,250.925
50%,60.0,105.0,131.0,318.6
75%,60.0,111.0,141.0,387.6
max,300.0,159.0,184.0,1860.4


In [13]:
df.dtypes

Duration      int64
Pulse         int64
Maxpulse      int64
Calories    float64
dtype: object

### Modifying DataFrames
In most cases modifying funcions just creating new dataframes without modifying the exis one., So if you want to modify the existing one then assign it to itself.

In [48]:
import pandas as pd
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], 
   index = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom', 'United States'])
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


#### Dropping values

In [49]:
df.drop("Canada")

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [50]:
df.drop(["Canada", "Japan"])

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [51]:
df.drop(columns=["GDP", "HDI"])

Unnamed: 0,Population,Surface Area,Continent
Canada,35.467,9984670,America
France,63.951,640679,Europe
Germany,80.94,357114,Europe
Italy,60.665,301336,Europe
Japan,127.061,377930,Asia
United Kingdom,64.511,242495,Europe
United States,318.523,9525067,America


In [52]:
df.drop(["Italy", "Japan"], axis=0)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [53]:
df.drop(["GDP", "HDI"], axis=1)

Unnamed: 0,Population,Surface Area,Continent
Canada,35.467,9984670,America
France,63.951,640679,Europe
Germany,80.94,357114,Europe
Italy,60.665,301336,Europe
Japan,127.061,377930,Asia
United Kingdom,64.511,242495,Europe
United States,318.523,9525067,America


#### Adding a Column

In [54]:
langs = pd.Series(
    ['French', 'German', 'Italian'],
    index=['France', 'Germany', 'Italy'],
    name='Language'
)
langs

France      French
Germany     German
Italy      Italian
Name: Language, dtype: object

In [55]:
df['Language'] = langs
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,
United Kingdom,64.511,2950039,242495,0.907,Europe,
United States,318.523,17348075,9525067,0.915,America,


#### Replacing values in column

In [56]:
df["Language"] = "English"
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


#### Renaming column and row names

In [58]:
df.rename(
    columns={
        'HDI': 'Human Development Index',
        'Anual Popcorn Consumption': 'APC'
    }, index={
        'United States': 'USA',
        'United Kingdom': 'UK',
        'Argentina': 'AR'
    })

Unnamed: 0,Population,GDP,Surface Area,Human Development Index,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
UK,64.511,2950039,242495,0.907,Europe,English
USA,318.523,17348075,9525067,0.915,America,English


In [59]:
# only adding column have modified original DataFrame.
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,English
Germany,80.94,3874437,357114,0.916,Europe,English
Italy,60.665,2167744,301336,0.873,Europe,English
Japan,127.061,4602367,377930,0.891,Asia,English
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


# Panel
![Pandas_3]()