# Data processing: Pandas

* Python library for data manipulation and analysis
* high-performance and efficient
* easy to use
* to handle structured data
* perform operations like: filtering, grouping, aggregation
* data visualization

Often used in fields as:
* data science
* machine learning
* data analysis




---------------------------------------------------------

First, **install** pandas using:
    pip install pandas

**Import pandas**

In [38]:
import pandas as pd  #pd commonly used abbreviation
from datetime import datetime

## Pandas Series

* A one-dimensional array-like object
* Can hold data of any type (integer, float, string, etc.)
* Labelled data, each element has an index. 
* Can be created from lists, arrays, dictionaries, and existing Series objects
* Building block for the Pandas DataFrame
* Like a column in a spreadsheet, a single column of a database table
* The length of a Series cannot be changed (but values can be changed and columns can be inserted into DataFrames)


--> The vast majority of Pandas methods produce new objects, leaving the input data untouched. 
* Immutability is favored where sensible



### Creating a Pandas Series from a list

In [3]:
data = ["Mickey", "Minnie", "Pluto", "Donald Duck"]
series_from_list = pd.Series(data)
series_from_list 

0         Mickey
1         Minnie
2          Pluto
3    Donald Duck
dtype: object

If no index labels are specified, they are labeled with their index numbers (starting from 0)

### Creating a Pandas Series from a dictionary

In [4]:
data = {"a": 1, "b": 2, "c": 3, "d":4}
series_from_dict = pd.Series(data)
series_from_dict

a    1
b    2
c    3
d    4
dtype: int64

In [5]:
my_series = pd.Series({'London':10, 'Tripoli':100, 'Cairo':10})

* Index labels: 'London', 'Tripoli', 'Cairo'
* Values of the Series: 10, 100, 10

In [6]:
# Displaying the Series
my_series

London      10
Tripoli    100
Cairo       10
dtype: int64

### Creating a Series with a custom index

In [7]:
data = ["Pooh", "Winnie", "Kanga", "Roo"]
index = ["Participant1", "Participant2", "Participant3", "Participant4"]
series_custom_index = pd.Series(data, index=index)
print(series_custom_index)

Participant1      Pooh
Participant2    Winnie
Participant3     Kanga
Participant4       Roo
dtype: object


In [8]:
#Creating a Series using only the data from a dict with the keys "London" and "Cairo"
lc_series = pd.Series(my_series, index=["London", "Cairo"])
lc_series

London    10
Cairo     10
dtype: int64

### Accessing and manipulating data using the index label

In [9]:
# Accessing a specific element in the Series using the index label
my_series['Tripoli']

100

In [10]:
print(series_from_list[0])
print(series_from_dict["b"])

Mickey
2


In [16]:
# Filtering elements based on a condition
my_series[my_series > 10]

Tripoli    100
dtype: int64

### Attributes
A Series object has several attributes: index, values, dtype, shape, ndim, size, name...

In [105]:
#Let's play around with the index

brand = ['Yamaha', 'KTM', 'Honda', 'Kawasaki', 'Suzuki']
models = ['R1', 'Superduke 1290', 'CBR600RR', 'ZX6R', 'GXSR 750']
series_moto = pd.Series(models, index=brand)
print(series_moto)
print("\nchanging the index of our series:")
series_moto.index = ['YAM', 'KTM', 'HON', 'KAW', 'SUZ']
series_moto.index

Yamaha                  R1
KTM         Superduke 1290
Honda             CBR600RR
Kawasaki              ZX6R
Suzuki            GXSR 750
dtype: object

changing the index of our series:


Index(['YAM', 'KTM', 'HON', 'KAW', 'SUZ'], dtype='object')

In [106]:
#Let's play around with the arrays
series_moto = pd.Series(models)
array_moto = pd.Series(['car', 'motorcycle', 636]).array
print(f'{array_moto}\n')

categorical = pd.Series(pd.Categorical([datetime.now()]))
print(f'{categorical.array}\n')

interval = pd.Series(pd.Interval(3, 4545))
print(f'{interval.array}\n')

period = pd.Series(pd.Period('2020', 'Y'))
print(period.array)
# The pd.Period function in pandas is used to represent a time span, 
# such as days, months, quarters, or years. The frequency parameter 
# (in your case 'bro') specifies the length of the periods in terms of 
# the number of units and the type of unit.

# Here are some valid frequency aliases:

# 'D': daily
# 'M': month end
# 'Y': year end
# 'H': hourly
# 'T' or 'min': minutely
# 'S': secondly

<NumpyExtensionArray>
['car', 'motorcycle', 636]
Length: 3, dtype: object

[2024-05-28 20:17:31.427218]
Categories (1, datetime64[ns]): [2024-05-28 20:17:31.427218]

<IntervalArray>
[(3, 4545]]
Length: 1, dtype: interval[int64, right]

<PeriodArray>
['2020']
Length: 1, dtype: period[Y-DEC]


In [107]:
# Let's play with values
series_moto.values

array(['R1', 'Superduke 1290', 'CBR600RR', 'ZX6R', 'GXSR 750'],
      dtype=object)

In [110]:
# Let's play with dtypes
integers = pd.Series([3, 5, 6])
integers.dtype

dtype('int64')

In [111]:
# Let's play with shape
series_moto.shape
# Return a tuple of the shape of the underlying data.

(5,)

In [112]:
# Let's play with nbytes
series_moto.nbytes
# Returns the number of bytes in the underlying data.

40

In [122]:
# Let's play with ndim
print(f'Dimensions for Series: {series_moto.ndim}')
# Number of dimensions, basically. Series is always one-dimension.
# A DataFrame is 2 dimensions.
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(f'Dimensions for a DataFrame: {df.ndim}')

Dimensions for Series: 1
Dimensions for a DataFrame: 2


Checkout the [API reference](https://pandas.pydata.org/docs/reference/series.html#constructor) for more functionality.

## Pandas DataFrame

* Two-dimensional labeled data structure
* Like a spreadsheet or table
* Three main components
    - the data: stored in rows and columns
    - the rows: labeled by an index
    - the columns: labeled and contain the actual data
* Flexible indexing
* Acces to rows, columns or individual elements based on labels or integer positions
* Size can be changed


### Creating a DataFrame from a dictionary

In [123]:
# Sample data (dict) for DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['London', 'New York', 'Paris', 'Tokyo', 'Sydney'],
    'Salary': [60000, 75000, 80000, 70000, 65000]
}

In [124]:
# Creating a DataFrame
df = pd.DataFrame(data)

# Displaying the DataFrame
df

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,London,60000
1,Bob,30,New York,75000
2,Charlie,35,Paris,80000
3,David,28,Tokyo,70000
4,Emma,32,Sydney,65000


### Creating a DataFrame from a list of lists

In [125]:
data = [['John', 25, 'New York'],
        ['Alice', 30, 'Los Angeles'],
        ['Bob', 35, 'Chicago']]
columns = ['Name', 'Age', 'City']
df2 = pd.DataFrame(data, columns=columns)
df2

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Alice,30,Los Angeles
2,Bob,35,Chicago


### Indexing

##### Accessing a column

In [126]:
# Accessing a column
print(df['Name'])

0      Alice
1        Bob
2    Charlie
3      David
4       Emma
Name: Name, dtype: object


##### Accesing a row by label

In [131]:
 # Accessing a row by label aka row
print(df.loc[0])

Name       Alice
Age           25
City      London
Salary     60000
Name: 0, dtype: object


##### Accessing a row by integer position

In [132]:
 # Accessing a row by integer position
print(df.iloc[0])

Name       Alice
Age           25
City      London
Salary     60000
Name: 0, dtype: object


##### Accessing an individual element

In [134]:
# Accessing an individual element
print(df.at[3, 'Name'])
print(df.at[2, 'Salary'])

David
80000


### Slicing

##### Index labels: loc

In [135]:
# Using index labels for rows and columns: loc

df.loc[1:3, 'Name':'City']  

Unnamed: 0,Name,Age,City
1,Bob,30,New York
2,Charlie,35,Paris
3,David,28,Tokyo


##### Integer positions: iloc

In [136]:
# Using integer positions for rows and columns: iloc
df.iloc[1:3, 0:3]

Unnamed: 0,Name,Age,City
1,Bob,30,New York
2,Charlie,35,Paris


***Beware! Labels vs indexing!***

In [145]:
print("Dataframe")
print(df)
print("\nLabels")
print(df.loc[1:3]) #labels
print("\n\nInteger positions")
print(df.iloc[1:3]) #integer positions

Dataframe
      Name  Age      City  Salary
0    Alice   25    London   60000
1      Bob   30  New York   75000
2  Charlie   35     Paris   80000
3    David   28     Tokyo   70000
4     Emma   32    Sydney   65000

Labels
      Name  Age      City  Salary
1      Bob   30  New York   75000
2  Charlie   35     Paris   80000
3    David   28     Tokyo   70000


Integer positions
      Name  Age      City  Salary
1      Bob   30  New York   75000
2  Charlie   35     Paris   80000


### Column operations

##### Adding a new column

In [146]:
# Adding a new column
df['Experience'] = [3, 5, 7, 4, 6]
df

Unnamed: 0,Name,Age,City,Salary,Experience
0,Alice,25,London,60000,3
1,Bob,30,New York,75000,5
2,Charlie,35,Paris,80000,7
3,David,28,Tokyo,70000,4
4,Emma,32,Sydney,65000,6


##### Boolean indexing

In [147]:
# Filtering rows based on a condition (boolean indexing)
high_salary_employees = df[df['Salary'] > 70000]
print(high_salary_employees)

      Name  Age      City  Salary  Experience
1      Bob   30  New York   75000           5
2  Charlie   35     Paris   80000           7


##### Sorting by column

In [150]:
# Sorting DataFrame by a column
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)

      Name  Age      City  Salary  Experience
2  Charlie   35     Paris   80000           7
4     Emma   32    Sydney   65000           6
1      Bob   30  New York   75000           5
3    David   28     Tokyo   70000           4
0    Alice   25    London   60000           3


### Missing Data Handling

In [168]:
#Add a row with missing data
import numpy as np
new_row = {"Name": "Mo", "Age": 23, "City": "Leuven", "Salary": np.nan, "Experience": 1}
df.loc[len(df)] = new_row
df

Unnamed: 0,Name,Age,City,Salary,Experience
0,Alice,25,London,60000.0,3
1,Bob,30,New York,75000.0,5
2,Charlie,35,Paris,80000.0,7
3,David,28,Tokyo,70000.0,4
4,Emma,32,Sydney,65000.0,6
5,Mo,23,Leuven,,1


In [169]:
# Dropping rows with missing values
df.dropna()
df # Not the right way!
# In case you want to do it faster & the correct way:
# df.dropna(inplace=True)

Unnamed: 0,Name,Age,City,Salary,Experience
0,Alice,25,London,60000.0,3
1,Bob,30,New York,75000.0,5
2,Charlie,35,Paris,80000.0,7
3,David,28,Tokyo,70000.0,4
4,Emma,32,Sydney,65000.0,6
5,Mo,23,Leuven,,1


In [170]:
# Dropping rows with missing values
k = df.dropna()
k # Right way

Unnamed: 0,Name,Age,City,Salary,Experience
0,Alice,25,London,60000.0,3
1,Bob,30,New York,75000.0,5
2,Charlie,35,Paris,80000.0,7
3,David,28,Tokyo,70000.0,4
4,Emma,32,Sydney,65000.0,6


In [171]:
# Filling missing values with a specified value
p = df.fillna(0)
p

Unnamed: 0,Name,Age,City,Salary,Experience
0,Alice,25,London,60000.0,3
1,Bob,30,New York,75000.0,5
2,Charlie,35,Paris,80000.0,7
3,David,28,Tokyo,70000.0,4
4,Emma,32,Sydney,65000.0,6
5,Mo,23,Leuven,0.0,1


### Grouping and aggregation

In [172]:
# Grouping by a column and calculating mean
avg_age_by_city = df.groupby('City')['Age'].mean()
avg_age_by_city

City
Leuven      23.0
London      25.0
New York    30.0
Paris       35.0
Sydney      32.0
Tokyo       28.0
Name: Age, dtype: float64

### Attributes and underlying data
A DataFrame object has several attributes: index, columns, values, dtypes, axes, ndim, size,shape, empty, head, tail...

In [173]:
#Let's play around
df.shape
# Again here it's the same thing as shown with the Series.
# No need for a hassle, just use the documentation.

(6, 5)

Use ***info()*** method to display information about the DataFrame

In [174]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Name        6 non-null      object 
 1   Age         6 non-null      int64  
 2   City        6 non-null      object 
 3   Salary      5 non-null      float64
 4   Experience  6 non-null      int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 288.0+ bytes


The ***describe()*** method generates summary statistics (count, mean, std, min, 25%, 50%, 75%, max) for numeric columns in the DataFrame

In [175]:
df.describe()  #p

Unnamed: 0,Age,Salary,Experience
count,6.0,5.0,6.0
mean,28.833333,70000.0,4.333333
std,4.445972,7905.69415,2.160247
min,23.0,60000.0,1.0
25%,25.75,65000.0,3.25
50%,29.0,70000.0,4.5
75%,31.5,75000.0,5.75
max,35.0,80000.0,7.0


### Operations

In [176]:
# Sample data
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1],
    'C': [10, 20, 30, 40, 50]
}

df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1,5,10
1,2,4,20
2,3,3,30
3,4,2,40
4,5,1,50


#### Arithmetic operations

In [177]:
# Addition
print(df['A'] + df['B'])  # Adds columns A and B element-wise

0    6
1    6
2    6
3    6
4    6
dtype: int64


In [178]:
# Subtraction
print(df['C'] - df['B'])  # Subtracts column B from column C element-wise

0     5
1    16
2    27
3    38
4    49
dtype: int64


In [179]:
df

Unnamed: 0,A,B,C
0,1,5,10
1,2,4,20
2,3,3,30
3,4,2,40
4,5,1,50


In [180]:
# Multiplication
print(df['A'] * df['C'])  # Multiplies columns A and C element-wise

0     10
1     40
2     90
3    160
4    250
dtype: int64


In [181]:
# Division
print(df['C'] / df['A'])  # Divides column C by column A element-wise

0    10.0
1    10.0
2    10.0
3    10.0
4    10.0
dtype: float64


#### Statistical aggregations

In [182]:
# Mean
print(df.mean())  # Calculates the mean of each column

A     3.0
B     3.0
C    30.0
dtype: float64


In [183]:
# Median
print(df.median())  # Calculates the median of each column

A     3.0
B     3.0
C    30.0
dtype: float64


#### Merging and joining DataFrames

In [184]:
# Concatenating along columns
df_concat = pd.concat([df, df], axis=1)  # Concatenates the DataFrame with itself along columns
df_concat

Unnamed: 0,A,B,C,A.1,B.1,C.1
0,1,5,10,1,5,10
1,2,4,20,2,4,20
2,3,3,30,3,3,30
3,4,2,40,4,2,40
4,5,1,50,5,1,50


In [185]:
# Concatenating along rows
df_concat = pd.concat([df, df], axis=0)  # Concatenates the DataFrame with itself along columns
df_concat

Unnamed: 0,A,B,C
0,1,5,10
1,2,4,20
2,3,3,30
3,4,2,40
4,5,1,50
0,1,5,10
1,2,4,20
2,3,3,30
3,4,2,40
4,5,1,50


### Creating a dataframe from data in a CSV

In [189]:
df = pd.read_csv('spotify-songs.csv', encoding="ISO-8859-1")
# The encoding part really depends on how the csv file is encoded. Usually it's UTF-8.
# On certain occasions, it could be UTF-8. Documentation useful here!