
![alt text](https://i.imgur.com/HRhd2Y0.png)

In Brazil, the postal code system is known as CEP, which stands for "Código de Endereçamento Postal" or "Postal Addressing Code" in English. The CEP is used by Correios, the Brazilian postal service, to sort and deliver mail efficiently. The system is somewhat analogous to the ZIP code system in the United States.

### Format of CEP
The Brazilian CEP is a string of eight numerical digits, usually formatted as `XXXXX-XXX`. 

1. **The first five digits** represent the area, district, or even a specific street in larger cities.
2. **The hyphen followed by the last three digits** provide more granularity, often specifying a range of addresses within the area indicated by the first five digits.

For example, the CEP `01311-000` would refer to a specific region within the city of São Paulo.

### Utility Beyond Postal Service
Besides its use in the postal system, the CEP is also commonly used in Brazil for various other purposes, like identifying addresses in online forms or in geographic information systems.

### Summary
1. CEP is Brazil's version of a ZIP code, used for efficient mail sorting and delivery.
2. It consists of eight numbers, typically formatted as `XXXXX-XXX`.
3. The code is hierarchically structured, with the first five numbers indicating a broader area and the last three providing more specific information.
4. The CEP is essential not only for mail delivery but also for other services that require address information.

# Import libraries

In [1]:
from functions import find_csv_files, load_csvs_to_dict, sort_and_classify_column, get_country_bounding_box, set_outliers, impute_values
from sklearn.impute import SimpleImputer
from unidecode import unidecode
import numpy as np
import pandas as pd
import plotly.express as px
from ydata_profiling import ProfileReport

# Step 1

In [2]:
# Path to the folder containing data files
DATA_PATH = "/Users/typhaine/Documents/Doc_Gorilla/OpenClassroom--Machine-Learning-Engineer/P4/data"

# Country of interest for the analysis
COUNTRY = "Brazil"

# Find all CSV files in the data directory
csv_files = find_csv_files(DATA_PATH)

# Load CSV files into a dictionary of DataFrames
dfs = load_csvs_to_dict(csv_files)

# Get the geographical bounding box coordinates for the specified country
country_bounds = get_country_bounding_box(COUNTRY)

# Select the DataFrame corresponding to the geolocation dataset for analysis
df = dfs["olist_geolocation_dataset"]

# Remove duplicate entries based on the 'geolocation_city' column
unique_cities_df = df.drop_duplicates(subset="geolocation_city")

# Create a scatter geo plot using the unique cities data
# Latitude is plotted on the y-axis, longitude on the x-axis
# Hovering over a point will display the name of the city
fig = px.scatter_geo(unique_cities_df, 
                     lat='geolocation_lat', 
                     lon='geolocation_lng', 
                     hover_name='geolocation_city')

# Display the scatter geo plot
fig.show()

# Step 2

In [3]:
# Remove accents from city names for standardization
df["geolocation_city"] = df["geolocation_city"].apply(unidecode)

# Mark or handle outliers in the DataFrame based on the geographical bounds of the country
df = set_outliers(df, country_bounds)

# Remove duplicate entries based on both the 'geolocation_city' and 'Outliers' columns
unique_cities_outliers_df = df.drop_duplicates(subset=["geolocation_city", "Outliers"])

# Create a scatter geo plot using the filtered data
# Latitude is plotted on the y-axis, longitude on the x-axis
# Hovering over a point will display the name of the city
# Color scheme is based on the 'Outliers' column
fig = px.scatter_geo(unique_cities_outliers_df, 
                     lat='geolocation_lat', 
                     lon='geolocation_lng', 
                     hover_name='geolocation_city',
                     color='Outliers')

# Display the scatter geo plot
fig.show()

# Step 3

In [4]:
# Step 1: Replace latitude and longitude values with 'None' for rows marked as 'Outlier'
df.loc[df["Outliers"] == "Outlier", ["geolocation_lat", "geolocation_lng"]] = None

# Step 2: Initialize SimpleImputer (default strategy is 'mean')
imputer = SimpleImputer()

# Step 3: Impute missing values based on city and zip code
# Create a set of unique combinations of city and zip code for rows marked as 'Outlier'
filter_values = set(zip(df[df["Outliers"] == "Outlier"].geolocation_city, 
                         df[df["Outliers"] == "Outlier"].geolocation_zip_code_prefix))

# Loop over each unique combination and apply imputation
for city, zip_code in filter_values:
    condition = (df.geolocation_zip_code_prefix == zip_code) & (df.geolocation_city == city)
    if len(df[condition]) > 1:
        df = impute_values(df, condition, imputer)

# Step 4: Impute missing values based solely on city
df = set_outliers(df, country_bounds)
filter_values = set(df[df["Outliers"] == "Outlier"].geolocation_city)
for city in filter_values:
    condition = (df.geolocation_city == city)
    if len(df[condition]) > 1:
        df = impute_values(df, condition, imputer)

# Step 5: Impute missing values based solely on state
df = set_outliers(df, country_bounds)
filter_values = set(df[df["Outliers"] == "Outlier"].geolocation_state)
for state in filter_values:
    condition = (df.geolocation_state == state)
    if len(df[condition]) > 1:
        df = impute_values(df, condition, imputer)

# Step 6: Final outlier setting (reason for repetition should be clarified)
df = set_outliers(df, country_bounds)

# Step 7: Create and display the scatter geo plot
unique_cities_outliers_df = df.drop_duplicates(subset=["geolocation_city", "Outliers"])
fig = px.scatter_geo(unique_cities_outliers_df, 
                     lat='geolocation_lat', 
                     lon='geolocation_lng', 
                     hover_name='geolocation_city',
                     color='Outliers')
fig.show()



Skipping features without any observed values: [0]. At least one non-missing value is needed for imputation with strategy='mean'.


Skipping features without any observed values: [0]. At least one non-missing value is needed for imputation with strategy='mean'.


Skipping features without any observed values: [0]. At least one non-missing value is needed for imputation with strategy='mean'.


Skipping features without any observed values: [0]. At least one non-missing value is needed for imputation with strategy='mean'.



# Step 4

In [5]:
# Remove accents from city names for standardization
dfs["olist_sellers_dataset"]["seller_city"] = dfs["olist_sellers_dataset"]["seller_city"].apply(unidecode)
dfs["olist_customers_dataset"]["customer_city"] = dfs["olist_customers_dataset"]["customer_city"].apply(unidecode)

In [6]:
df = df.drop(columns="Outliers").groupby(by = ["geolocation_state", "geolocation_city", "geolocation_zip_code_prefix"]).mean().reset_index().drop_duplicates(subset = ["geolocation_city", "geolocation_zip_code_prefix"])

In [7]:

df

Unnamed: 0,geolocation_state,geolocation_city,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng
0,AC,acrelandia,69945,-10.059374,-67.037486
1,AC,assis brasil,69935,-10.935106,-69.557831
2,AC,brasileia,69932,-11.005941,-68.749175
3,AC,bujari,69926,-9.798651,-68.004247
4,AC,campinas,69929,-22.861615,-47.064303
...,...,...,...,...,...
19613,TO,tocantinia,77640,-9.567553,-48.375380
19614,TO,tocantinopolis,77900,-6.325758,-47.430900
19615,TO,tupirama,77704,-8.971529,-48.188261
19616,TO,wanderlandia,77860,-6.850639,-47.965304


In [8]:
#df.drop(columns="Outliers").groupby(by = ["geolocation_state", "geolocation_city", "geolocation_zip_code_prefix"]).mean().reset_index().drop_duplicates(subset = "geolocation_zip_code_prefix")

In [9]:
olist_customers_df = dfs["olist_customers_dataset"].groupby(by = "customer_zip_code_prefix").agg(lambda x: list(set(x)) if len(set(x)) > 1 else list(set(x))[0]).reset_index()
olist_customers_df

Unnamed: 0,customer_zip_code_prefix,customer_id,customer_unique_id,customer_city,customer_state
0,1003,7ae2a9337aa4bc799723511faa1d6830,0c1a20644f0dc126c3eaff8dbc1bd12c,sao paulo,SP
1,1004,"[a09edf8c1e842e94805a206b3d73eed5, ee9b73e88af...","[095e7c124c5c1ccb1eb9f731152eae6a, 968f6d2f674...",sao paulo,SP
2,1005,"[6e4eb34e7f4d526a82726712aa17c02b, 6ec2b468281...","[ded4351942c7fc292b88e5b090af2b46, 84a7776f914...",sao paulo,SP
3,1006,"[fe9db57b1fe84352125989acb12d3c68, 467bcdf6e97...","[0968690d0565e9870ff423120a1051e8, d2d4ae284fb...",sao paulo,SP
4,1007,"[b7d357b5d22f91e4ee177ed8ae73e5d8, 6099b4a3bbc...","[29b909cd71ce43caa89ebfd289dd01e2, 3d0c8c998a3...",sao paulo,SP
...,...,...,...,...,...
14989,99960,"[158af3ad0742719d373724b762396918, 2c07700e162...","[344abd5603ff5310fdd0aa55be719873, e96373fff96...",charrua,RS
14990,99965,"[235702411e5214f0e4a0761bf5ce9e99, f93bb6c70a7...","[ee0a41f3ec008b1459efbd99c775e6ea, 9fc089b0b6a...",agua santa,RS
14991,99970,3ab8bc00f8740d54afc4c771fb6c7f69,0528a0a940c7116ccb48fdbb8e80a8ff,ciriaco,RS
14992,99980,"[657ba09c6edfbbc09f6054f541ec1f90, 964b34423c8...","[e49eafd7e69d43b8d86f6b5590fafd02, 3dbb390afed...",david canabarro,RS


In [10]:
olist_sellers_df = dfs["olist_sellers_dataset"].groupby(by = "seller_zip_code_prefix").agg(lambda x: list(set(x)) if len(set(x)) > 1 else list(set(x))[0]).reset_index()
olist_sellers_df

Unnamed: 0,seller_zip_code_prefix,seller_id,seller_city,seller_state
0,1001,8602a61d680a10a82cceeeda0d99ea3d,sao paulo,SP
1,1021,"[e0487761face83d64fcada2408959a36, dd55f1bb788...",sao paulo,SP
2,1022,09bad886111255c5b5030314fc7f1a4a,sao paulo,SP
3,1023,"[82921991ff5b557b045605b8bbf08d49, f049a72cf58...",sao paulo,SP
4,1026,"[c1dde11f12d05c478f5de2d7319ad3b2, c84592044b1...",sao paulo,SP
...,...,...,...,...
2241,99300,f524ad65d7e0f1daab730ef2d2e86196,soledade,RS
2242,99500,"[447d377bdb757058acb569025ee18a93, b1a81260566...",carazinho,RS
2243,99670,4fae87d32467e18eb46e4a76a0a0b9ce,ronda alta,RS
2244,99700,968ee78631915a63fef426d6733d7422,erechim,RS


In [11]:
df = df.merge(olist_sellers_df, 
              how = "left", 
              left_on = "geolocation_zip_code_prefix", 
              right_on = "seller_zip_code_prefix"
              ).merge(olist_customers_df,
                      how = "left", 
                      left_on = "geolocation_zip_code_prefix", 
                      right_on = "customer_zip_code_prefix")

In [12]:
c1 = (df["customer_id"].notnull()) & (df["seller_id"].notnull())
c2 = (df["customer_id"].notnull()) & (df["seller_id"].isnull())
c3 = (df["customer_id"].isnull()) & (df["seller_id"].notnull())

conditions = [c1, c2, c3]
choices = ["Seller & Buyer", "Buyer", "Seller"]
df["Locations"] = np.select(conditions, choices, default='None')

In [13]:
fig = px.scatter_geo(df, 
                     lat='geolocation_lat', 
                     lon='geolocation_lng', 
                     hover_name='geolocation_city',
                     color='Locations',
                     scope='south america')
fig.show()


In [17]:
df

Unnamed: 0,geolocation_state,geolocation_city,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,seller_zip_code_prefix,seller_id,seller_city,seller_state,customer_zip_code_prefix,customer_id,customer_unique_id,customer_city,customer_state,Locations
0,AC,acrelandia,69945,-10.059374,-67.037486,,,,,,,,,,
1,AC,assis brasil,69935,-10.935106,-69.557831,,,,,,,,,,
2,AC,brasileia,69932,-11.005941,-68.749175,,,,,69932.0,b1161707c5b3711b7cf6213c114c91b2,650dd69e8e20391188d727da3599e9a7,brasileia,AC,Buyer
3,AC,bujari,69926,-9.798651,-68.004247,,,,,,,,,,
4,AC,campinas,69929,-22.861615,-47.064303,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19608,TO,tocantinia,77640,-9.567553,-48.375380,,,,,,,,,,
19609,TO,tocantinopolis,77900,-6.325758,-47.430900,,,,,77900.0,"[6427d03ab887d0eaa03271f68550c277, 57d7a0e9724...","[f9dbe8ef6911fa7708aaf13d1be3ad75, d7241ab9272...",tocantinopolis,TO,Buyer
19610,TO,tupirama,77704,-8.971529,-48.188261,,,,,,,,,,
19611,TO,wanderlandia,77860,-6.850639,-47.965304,,,,,,,,,,


In [15]:
res = []
for val in df["customer_unique_id"].values:
    if isinstance(val, list):
        for val2 in val:
            res.append(val2)
    else:
        res.append(val)

print(len(list(set(res))[1:]))
print(len(dfs["olist_customers_dataset"]["customer_unique_id"].unique()))

95828
96096


In [16]:
res = []
for val in df["seller_id"].values:
    if isinstance(val, list):
        for val2 in val:
            res.append(val2)
    else:
        res.append(val)

print(len(list(set(res))[1:]))
print(len(dfs["olist_sellers_dataset"]["seller_id"].unique()))

3088
3095
