## VAMOS A PROBAR ESTE DATASET DE GIT QUE HE ENCONTRADO: 


In [29]:
import pandas as pd
import logging
import numpy as np
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

route = '../data/SBS_Certified_Business_List.csv'  

try:
    df = pd.read_csv(route)
    first_row = df.iloc[0].to_dict()
    logging.info("First row data:")
    for columna, valor in first_row.items():
        logging.info(f"{columna}: {valor}")
except Exception as e:
    logging.error("Error loading csv: %s", e)
    
print(df.shape)



2024-04-02 21:55:46,415 - INFO - First row data:
2024-04-02 21:55:46,416 - INFO - Account_Number: 311147
2024-04-02 21:55:46,418 - INFO - Vendor_Formal_Name: #1Pho Inc
2024-04-02 21:55:46,421 - INFO - Vendor_DBA: Zenyai
2024-04-02 21:55:46,423 - INFO - First_Name: Albert
2024-04-02 21:55:46,424 - INFO - Last_Name: Jethanamest
2024-04-02 21:55:46,426 - INFO - telephone: 6463879761
2024-04-02 21:55:46,427 - INFO - Business_Description: Zenyai Viet Cajun & Pho Restaurant is dedicated to offering real Vietnamese flavor through distinct seafood boils and pho noodle dishes.
2024-04-02 21:55:46,429 - INFO - Certification: MBE
2024-04-02 21:55:46,432 - INFO - Certification_Renewal_Date: 10/31/2025
2024-04-02 21:55:46,433 - INFO - Ethnicity: ASIAN
2024-04-02 21:55:46,436 - INFO - Address_Line_1: 208 Grand Street
2024-04-02 21:55:46,439 - INFO - Address_Line_2: nan
2024-04-02 21:55:46,443 - INFO - City: Brooklyn
2024-04-02 21:55:46,444 - INFO - State: NY
2024-04-02 21:55:46,447 - INFO - Postcode

(11430, 56)


LET'S MAKE SOME ANALYSIS OF THE DATASET TO UNDERSTAND IT BETTER:

Firstly, let's clarify that we are carrying this project as a classification problem; where we are going to procced classifying each digit of the NAICS code one by one.

First digit of the NAICS code: Represents the general economic sector.
Second digit: Provides subsector information within the economic sector.
Third digit: Gives more details on the subsector's subdivision.
Fourth digit: Indicates the group industry.
Fifth digit: Offers even more specific information, often related to the national industry.
Sixth digit: Used for more detailed and specific classifications within a national industry, although not all countries use the sixth digit.


This approach results in a separate classification problem for each of the six digits of the NAICS code.

Before starting to solve any datset trade-offs, let's see how are the bussinesses on it:

In [41]:
try:
    unique_naics_count = df['ID6_digit_NAICS_code'].nunique()
    logging.info(f"Number of unique NAICS codes: {unique_naics_count}")

    top_5_repeated_naics = df['ID6_digit_NAICS_code'].value_counts().head(5)
    top_5_less_repeated_naics = df['ID6_digit_NAICS_code'].value_counts().tail(5)
    logging.info("Top 5 most repeated NAICS codes and their counts:")
    for code, count in top_5_repeated_naics.items():
        logging.info(f"NAICS code {code} is repeated {count} times.")
    for code, count in top_5_less_repeated_naics.items():
        logging.info(f"NAICS code {code} is repeated {count} time.")

except KeyError as e:
    logging.error(f"Column not found in DataFrame: {e}")
except Exception as e:
    logging.error(f"An unexpected error occurred during the analysis: {e}")

2024-04-03 09:55:17,203 - INFO - Number of unique NAICS codes: 588
2024-04-03 09:55:17,208 - INFO - Top 5 most repeated NAICS codes and their counts:
2024-04-03 09:55:17,209 - INFO - NAICS code 541990 is repeated 534 times.
2024-04-03 09:55:17,210 - INFO - NAICS code 541310 is repeated 363 times.
2024-04-03 09:55:17,211 - INFO - NAICS code 541330 is repeated 348 times.
2024-04-03 09:55:17,211 - INFO - NAICS code 238990 is repeated 339 times.
2024-04-03 09:55:17,212 - INFO - NAICS code 611710 is repeated 337 times.
2024-04-03 09:55:17,213 - INFO - NAICS code 325510 is repeated 1 time.
2024-04-03 09:55:17,213 - INFO - NAICS code 524114 is repeated 1 time.
2024-04-03 09:55:17,214 - INFO - NAICS code 812332 is repeated 1 time.
2024-04-03 09:55:17,215 - INFO - NAICS code 522320 is repeated 1 time.
2024-04-03 09:55:17,216 - INFO - NAICS code 455110 is repeated 1 time.


As of the latest revision, the American NAICS system defines approximately 1,057 distinct 6-digit codes. As we can see, in this dataset we have only defined 588 different NAICS codes. 
The range of NAICS codes in our dataset is influenced by the timing of data collection, reflecting economic conditions and industry relevance at that moment. Specific focus on certain industry subsectors or objectives may limit the diversity of NAICS codes captured. Geographic location could play a crucial role, as regional economic activities dictate the presence of particular industries. 

More precisely, we see that there are some NAICS codes that are repeated a lot of times. This is very important and significant since it probably means that the dataset is skewed towards certain industries or sectors. This skewness could indicate a concentration of economic activity within these sectors or a particular focus of the dataset collection efforts. Understanding this concentration can provide insights into industry dominance, regional economic strengths, or the specific research objectives guiding the data compilation.

In [32]:
nan_count = df.isna().sum()
logging.info("missing values per feature:\n%s", nan_count)

2024-04-02 22:12:36,254 - INFO - missing values per feature:
Account_Number                                   0
Vendor_Formal_Name                               0
Vendor_DBA                                    9937
First_Name                                      23
Last_Name                                       24
telephone                                       12
Business_Description                             7
Certification                                    0
Certification_Renewal_Date                    3164
Ethnicity                                        3
Address_Line_1                                 116
Address_Line_2                                6984
City                                             0
State                                            0
Postcode                                         1
Mailing_Address_Line_1                           2
Mailing_Address_Line_2                        6888
Mailing_City                                     2
Mailing_State        

We can get a few conclutions from this approach. 
As always, one of the reasons for NaNs can be errors in data collection, so I'm going to ommit this in the following explanations: 
    - vendor_DBA has a lot of Nans. Why? Maybe the formal name is the same as the Vendor_DBA. We can see in the following cell that even from those rows where Vendor_DBA has a value, 51 coincide with the formal name.

Let's firstly get rid of those features which are all NANs:

In [37]:
df= df.dropna(axis=1, how='all')
matching_rows = df[df['Vendor_DBA'].notna() & (df['Vendor_DBA'] == df['Vendor_Formal_Name'])].shape[0]
logging.info(f"Number of rows where 'Vendor_DBA' is not NaN and matches 'Vendor_Formal_Name': {matching_rows}")




2024-04-03 09:32:35,293 - INFO - Number of rows where 'Vendor_DBA' is not NaN and matches 'Vendor_Formal_Name': 51


Número de códigos NAICS distintos: 588
Máximo número de veces que un código NAICS se repite: 534
