# 1. Dataset Description

In a markrting **`sales_data`** dataset the **`region`** variable is coded with the values `N`, `S`, `E`, and `W`.
And here, we have to change that to **`N:North, S:South, E:East, and S:South`**
in order to use the actual full direction names.

To do this, need to create a mapping from the single-letter values to the full names using a Python dictionary. The **keys** are the original single letters, and the associated **values** are the direction names. Let's called this mapping as `region_mapping`.

Then to recode the values, we use the Pandas `.assign()` method.

Inside the call to `.assign()`, we're using the `.replace()` method to replace the old categorical letter values with the new direction names.

In the output, we can see that the values of the `region` variable have been recoded from the old single letters (e.g., **`'N'`**) to the full direction names (e.g., **`'North'`**).


# 2. Import libraries

In [1]:
import pandas as pd
import numpy as np

# 3. Dataset EDA (Exploratory Data Analysis)

In [2]:
# define raw dataset
raw_data = {
  'Name': ['John M', 'Anna', 'Peter', 'Linda', 'Ethan', 'Thomas', 'Edward', 'Amit'],
  'Sales': [1500000, 1200000, 1600000, 1300000, 2500000, 2900000, 3500000, 5000000],
  'Region': ['N', 'S', 'E', 'W', 'N', 'S', 'E', 'W']
}

# 4. Data Pre-processing

In [3]:
# create a sales's pandas dataframe
sales_df = pd.DataFrame(raw_data)
print(sales_df)

     Name    Sales Region
0  John M  1500000      N
1    Anna  1200000      S
2   Peter  1600000      E
3   Linda  1300000      W
4   Ethan  2500000      N
5  Thomas  2900000      S
6  Edward  3500000      E
7    Amit  5000000      W


# 5. Data Cleaning

In [4]:
# create categorical mapping
region_mapping = {
    'N': 'North',
    'S': 'South',
    'E': 'East',
    'W': 'West'
}

# 6. Data Evaluation

In [5]:
# recoding categorical variables
# sales_df = sales_df.assign(Region = sales_df['Region'].replace(region_mapping))
(sales_df.assign(Region = sales_df['Region'].replace(region_mapping)))
print(sales_df)

     Name    Sales Region
0  John M  1500000      N
1    Anna  1200000      S
2   Peter  1600000      E
3   Linda  1300000      W
4   Ethan  2500000      N
5  Thomas  2900000      S
6  Edward  3500000      E
7    Amit  5000000      W


# Example of Creating a DataFrame with Unequal Length of Arrays

In [6]:
import pandas as pd

# Incorrect case: Arrays of different lengths
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30],  # This array is shorter
    'City': ['New York', 'Los Angeles', 'Chicago']
}

# This will raise a ValueError
df = pd.DataFrame(data)
print(df)

ValueError: All arrays must be of the same length

## `ValueError: All arrays must be of the same length`

## Handling the Error
If you encounter this error, you need to ensure that all your data arrays have the same length. You can handle this by either:

1. Truncating longer arrays to match the length of the shortest array.
2. Padding shorter arrays with a placeholder value (like **None or NaN**) to match the length of the longest array.

## Example of Padding an Arrays to Match Lengths


In [7]:
import pandas as pd
import numpy as np

# Arrays with different lengths
names = ['Alice', 'Bob', 'Charlie']
ages = [25, 30]  # Shorter array
cities = ['New York', 'Los Angeles', 'Chicago']

# Determine the maximum length
max_length = max(len(names), len(ages), len(cities))

# Pad the shorter arrays with None or np.nan
names += [None] * (max_length - len(names))
ages += [None] * (max_length - len(ages))
cities += [None] * (max_length - len(cities))

# Create the DataFrame
data = {
    'Name': names,
    'Age': ages,
    'City': cities
}

df = pd.DataFrame(data)
print(df)

      Name   Age         City
0    Alice  25.0     New York
1      Bob  30.0  Los Angeles
2  Charlie   NaN      Chicago
