# Data Manipulation and Data Cleaning with Pandas

In [None]:
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Bastian', 'Ella', 'Jaco'],
    'age': [7, 83, 34, 12, 79, 35],
    'gender': ['F', 'M', 'M', 'M', 'F', 'M'],
    'city': ['Oxfordshire', 'Marshall', 'Kansas City', 'De Forest', 'Newport News', 'Norristown']
})

## Filtering Rows 
You can filter rows based on specific conditions using boolean indexing. 

In [None]:
df_filtered = df[df['age'] > 30]
print(df_filtered)

## Dropping Rows and Columns
You can drop rows and columns from a DataFrame using the `drop()` method. 

In [None]:
df_dropped = df.drop('city', axis=1)
print(df_dropped)

To drop rows based on specific conditions, you can use boolean indexing with negation.

In [None]:
df_dropped = df[~(df['age'] <= 30)]
print(df_dropped)

## Renaming Columns
You can rename columns in a DataFrame using the `rename()` method.

In [None]:
df_renamed = df.rename(columns={'age': 'years'})
print(df_renamed)

## Handling Missing Data
You can handle missing data in a DataFrame using the `fillna()` method. 

In [None]:
# create a dataframe with missing values
new_record = pd.Series({"name":"Ash", "gender":"M", "city":"Pallet Town"})
df_filled = pd.concat([df, new_record.to_frame().T], ignore_index=True)

df_filled['age'].fillna(df_filled['age'].mean(), inplace=True)
df_filled['age'] = df_filled['age'].astype('int')
print(df_filled)

In [None]:
import numpy as np

# create a dataframe with missing values
df['salary'] = [np.nan, 200, 300, np.nan, 500, 600]
new_record = pd.Series({"name":"Misty", "gender":"F", "city":"Cerulean City"})

df = pd.concat([df, new_record.to_frame().T], ignore_index=True)

df


In [None]:
# check for missing values
df.isnull()

In [None]:
# fill missing values with a specific value
df["salary"].fillna(0, inplace=True)
df

In [None]:
# drop rows with missing values
df.dropna(inplace=True)
df

## Removing duplicates

In [None]:
# create a dataframe with duplicate rows
df = pd.concat([df, df.iloc[4].to_frame().T], ignore_index=True)

# check for duplicate rows
print(df.duplicated())

print(df)
# drop duplicate rows
df.drop_duplicates(inplace=True)

df

# Data Transformation  
Beyond basic operations, you might need to reshape or aggregate your data in more complex ways:

## Converting data types

In [None]:
# check data types
print(df.dtypes)

# convert data types
df['age'] = df['age'].astype(int)
df['salary'] = pd.to_numeric(df['salary'])

# check data types again
print(df.dtypes)

In [None]:
# Apply a function to a column
df['age'] = df['age'].apply(lambda x: x + 1)
print(df)

## Grouping data
In Python, `groupby` is a method in the Pandas library that is used to group data in a Pandas DataFrame based on one or more columns. It is similar to the SQL GROUP BY statement.

The `groupby` method is used to group data by one or more columns, creating a DataFrameGroupBy object. The object can be used to perform aggregation operations, such as sum, mean, min, and max, on the grouped data.

In [None]:
df_grouped = df.groupby('gender').mean()['age']
print(df_grouped)

In [None]:
# create a sample DataFrame
df = pd.DataFrame({
    'Category': ['A', 'B', 'C', 'A', 'B', 'C'],
    'Value': [1, 2, 3, 4, 5, 6]
})

# group the DataFrame by the 'Category' column
grouped_df = df.groupby('Category')

# calculate the mean value for each group
mean_values = grouped_df.mean()

# display the mean values for each group
print(mean_values)

In this example, we created a DataFrame with two columns, 'Category' and 'Value', and then grouped the data by the 'Category' column using the `groupby` method. We then calculated the mean value for each group using the `mean` method, which returns a new DataFrame with the mean values for each group.

## Merging and Joining data
Merging and joining are ways to combine different data frames based on common columns or indices. The key difference between merging and joining is the way the two data frames are combined.

Merging combines two data frames based on the values of specified columns. If the two data frames share a common column, merging can be performed based on that column. Otherwise, merging can be done based on multiple columns.

Joining, on the other hand, combines two data frames based on their indices.  

There are four types of joins in Pandas:

* **Inner join**: returns only the rows with matching indices in both data frames.
* **Left join**: returns all the rows from the left data frame and the matching rows from the right data frame.
* **Right join**: returns all the rows from the right data frame and the matching rows from the left data frame.
* **Outer join**: returns all the rows from both data frames.

Pandas functions for merging and joining data frames, are `merge()`, `join()`, and `concat()`. These functions have different parameters for specifying the type of merge or join to perform, the columns or indices to merge on or join on, and the method of handling missing values.

In [None]:
# create first dataframe
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                    'value': [1, 2, 3, 4]}, 
                   index=[0, 1, 2, 3])

# create second dataframe
df2 = pd.DataFrame({'key': ['C', 'D', 'E', 'F'],
                    'value': [5, 6, 7, 8]}, 
                   index=[4, 5, 6, 7])


In [None]:
# perform inner join on 'key' column
inner_join_columns = pd.merge(df1, df2, on='key', how='inner')

print(inner_join_columns)

In [None]:
# perform inner join on rows
inner_join_rows = df1.join(df2, lsuffix='_left', rsuffix='_right', how='inner')

print(inner_join_rows)

In [None]:
# perform left join on 'key' column
left_join_columns = pd.merge(df1, df2, on='key', how='left')

print(left_join_columns)

In [None]:
# perform left join on rows
left_join_rows = df1.join(df2, lsuffix='_left', rsuffix='_right', how='left')

print(left_join_rows)

In [None]:
# perform right join on 'key' column
right_join_columns = pd.merge(df1, df2, on='key', how='right')

print(right_join_columns)

In [None]:
# perform right join on rows
right_join_rows = df1.join(df2, lsuffix='_left', rsuffix='_right', how='right')

print(right_join_rows)

In [None]:
# perform outer join on 'key' column
outer_join_columns = pd.merge(df1, df2, on='key', how='outer')

print(outer_join_columns)

In [None]:
# perform outer join on rows
outer_join_rows = df1.join(df2, lsuffix='_left', rsuffix='_right', how='outer')

print(outer_join_rows)

# Pivoting and Reshaping data

Pivoting and reshaping data refer to the process of transforming data from its original form to a more structured and organized form that is easier to analyze.

**Pivoting** refers to the process of reorganizing a dataframe so that the values of one column become the column headers, while the values in another column become the row indices. This can be useful for creating summary tables and reports.

**Reshaping**, on the other hand, refers to the process of changing the structure of a dataframe by rearranging its rows and columns. This can be useful for aggregating data and performing more complex data analysis.

Pandas provides several functions for pivoting and reshaping data, including `pivot()`, `melt()`, `stack()`, `unstack()`, `pivot_table()`, and `transpose()`. These functions allow you to manipulate dataframes in various ways to suit your analysis needs.

**Note:**  
**"wide"** format and a **"long"** format of data. In a **wide format**, a DataFrame has multiple columns for different variables, while in a **long format**, the same variables are stacked in a single column and identified by a second column of labels.

### Pivot
For example, you can use the `pivot()` function to transform a long-form dataframe into a wide-form dataframe by specifying the columns to use as row and column labels and the values to use for populating the cells:

In [None]:
# create a long-form dataframe
df_long = pd.DataFrame({'Year': [2010, 2010, 2011, 2011],
                        'Quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
                        'Sales': [100, 200, 150, 250]})

# pivot the dataframe to wide format
df_wide_pivot = df_long.pivot(index='Year', columns='Quarter', values='Sales')

In [None]:
df_long

In [None]:
df_wide_pivot

In this example, the `pivot()` function is used to pivot the `df_long` dataframe into a `df_wide` dataframe with the years as row labels, quarters as column labels, and sales values as cell values.

### Melt
Similarly, you can use the `melt()` function to transform a wide-form dataframe into a long-form dataframe by specifying the columns to use as id variables and the columns to use as value variables:

In [None]:
df_wide_1 = df_wide_pivot.reset_index()

df_wide_1

In [None]:
# melt the dataframe to long format
df_long_melt = pd.melt(df_wide_1, id_vars=['Year'], var_name='Quarter', value_name='Sales')

df_long_melt

In this example, the `melt()` function is used to melt the `df_wide` dataframe into a `df_long` dataframe with the years and quarters as id variables, and sales as the value variable.

### Stacking and Unstacking
The `stack()` operation pivots a level of the level of column labels into row labels/index, creating a multi-level index. The `unstack()` operation does the opposite, pivoting a level of row labels/index into column labels.

if we have a DataFrame with two columns, A and B, and two index levels, 1 and 2, calling stack() will result in a new DataFrame with three index levels (1, 2, and a new level created from the column labels), and one column that contains the stacked data.

In [None]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, 
                  index=pd.MultiIndex.from_tuples([(1, 'a'), (2, 'b'), (3, 'c')], 
                    names=['level_1', 'level_2']))
df

In [None]:
stacked = df.stack()
stacked.to_frame()

In [None]:
unstacked = stacked.unstack()
unstacked

### Pivot Table  
The `pivot_table()` function is used to summarize and aggregate data in a DataFrame. It takes several arguments, including `values`, which specifies the column to aggregate, `index`, which specifies the column(s) or index level(s) to group by, and `columns`, which specifies the column(s) or index level(s) to pivot. Then, creates a new DataFrame by aggregating data based on one or more columns and/or index levels.

In [None]:
df = pd.DataFrame({'A': [1, 2, 3, 1, 2, 3], 'B': [4, 5, 6, 7, 8, 9], 'C': ['x', 'y', 'z', 'x', 'y', 'z']})
df

In [None]:
pivot = df.pivot_table(values='B', index='A', columns='C', aggfunc='sum')
pivot

### Transposing
Transposing a DataFrame means interchanging rows and columns. This operation is useful to change the shape of the DataFrame to better suit a particular analysis or visualization.

In [None]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

df

In [None]:
transposed = df.transpose()

transposed

> Content created by [**Carlos Cruz-Maldonado**](https://www.linkedin.com/in/carloscruzmaldonado/).  
> I am available to answer any questions or provide further assistance.   
> Feel free to reach out to me at any time.  