3. **Informe del Análisis de Calidad de Datos:** Documentar en formato markdown los resultados del análisis de calidad de los datos, identificando problemas encontrados (como valores faltantes o inconsistencias) y detallando las soluciones implementadas para garantizar la confiabilidad del análisis posterior.

In [13]:
import pandas as pd

# Load datasets
cash_request_df = pd.read_csv('project_dataset/extract - cash request - data analyst.csv')
fees_df = pd.read_csv('project_dataset/extract - fees - data analyst - .csv')

# Display basic information about the dataset
print("Basic Information extract - cash request - data analyst.csv:")
print(cash_request_df.info())

Basic Information extract - cash request - data analyst.csv:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23970 entries, 0 to 23969
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          23970 non-null  int64  
 1   amount                      23970 non-null  float64
 2   status                      23970 non-null  object 
 3   created_at                  23970 non-null  object 
 4   updated_at                  23970 non-null  object 
 5   user_id                     21867 non-null  float64
 6   moderated_at                16035 non-null  object 
 7   deleted_account_id          2104 non-null   float64
 8   reimbursement_date          23970 non-null  object 
 9   cash_request_received_date  16289 non-null  object 
 10  money_back_date             16543 non-null  object 
 11  transfer_type               23970 non-null  object 
 12  send_at                    

In [14]:
# Display basic information about the dataset
print("Basic Information extract - fees - data analyst - .csv:")
print(fees_df.info())

Basic Information extract - fees - data analyst - .csv:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21061 entries, 0 to 21060
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               21061 non-null  int64  
 1   cash_request_id  21057 non-null  float64
 2   type             21061 non-null  object 
 3   status           21061 non-null  object 
 4   category         2196 non-null   object 
 5   total_amount     21061 non-null  float64
 6   reason           21061 non-null  object 
 7   created_at       21061 non-null  object 
 8   updated_at       21061 non-null  object 
 9   paid_at          15531 non-null  object 
 10  from_date        7766 non-null   object 
 11  to_date          7766 non-null   object 
 12  charge_moment    21061 non-null  object 
dtypes: float64(2), int64(1), object(10)
memory usage: 2.1+ MB
None


In [15]:
# Evaluate data quality
def evaluate_data_quality(df, df_name):
    print(f"Evaluating data quality for {df_name}...\n")
    
    # Check for missing values
    missing_values = df.isnull().sum()
    print(f"Missing values in {df_name}:\n{missing_values}\n")
    
    # Check for duplicate rows
    duplicate_rows = df.duplicated().sum()
    print(f"Duplicate rows in {df_name}: {duplicate_rows}\n")
    
    # Check for inconsistencies (example: negative values in columns that should only have positive values)
    for column in df.select_dtypes(include=['number']).columns:
        negative_values = (df[column] < 0).sum()
        if negative_values > 0:
            print(f"Column '{column}' in {df_name} has {negative_values} negative values.\n")
    
    # Summary statistics
    print(f"Summary statistics for {df_name}:\n{df.describe()}\n")

In [16]:
# Evaluate data quality for extract - cash request - data analyst dataset
evaluate_data_quality(cash_request_df, 'extract - cash request - data analyst.csv')

Evaluating data quality for extract - cash request - data analyst.csv...

Missing values in extract - cash request - data analyst.csv:
id                                0
amount                            0
status                            0
created_at                        0
updated_at                        0
user_id                        2103
moderated_at                   7935
deleted_account_id            21866
reimbursement_date                0
cash_request_received_date     7681
money_back_date                7427
transfer_type                     0
send_at                        7329
recovery_status               20640
reco_creation                 20640
reco_last_update              20640
dtype: int64

Duplicate rows in extract - cash request - data analyst.csv: 0

Summary statistics for extract - cash request - data analyst.csv:
                 id        amount        user_id  deleted_account_id
count  23970.000000  23970.000000   21867.000000         2104.000000
mean   

In [17]:
# Evaluate data quality for extract - fees - data analyst dataset
evaluate_data_quality(fees_df, 'extract - fees - data analyst - .csv')

Evaluating data quality for extract - fees - data analyst - .csv...

Missing values in extract - fees - data analyst - .csv:
id                     0
cash_request_id        4
type                   0
status                 0
category           18865
total_amount           0
reason                 0
created_at             0
updated_at             0
paid_at             5530
from_date          13295
to_date            13295
charge_moment          0
dtype: int64

Duplicate rows in extract - fees - data analyst - .csv: 0

Summary statistics for extract - fees - data analyst - .csv:
                 id  cash_request_id  total_amount
count  21061.000000     21057.000000  21061.000000
mean   10645.355111     16318.449162      5.000237
std     6099.315256      6656.149949      0.034453
min        1.000000      1456.000000      5.000000
25%     5385.000000     11745.000000      5.000000
50%    10652.000000     17160.000000      5.000000
75%    15925.000000     21796.000000      5.000000
max    2

In [None]:

# Data cleaning and preprocessing
def clean_data(df):
    # Fill missing values with appropriate strategies (example: mean for numerical columns)
    for column in df.select_dtypes(include=['number']).columns:
        df[column].fillna(df[column].mean(), inplace=True)
    
    # Drop duplicate rows
    df.drop_duplicates(inplace=True)
    
    # Remove negative values in numerical columns (example: set them to zero)
    for column in df.select_dtypes(include=['number']).columns:
        df[column] = df[column].apply(lambda x: max(x, 0))
    
    return df

# Clean both datasets
cleaned_cash_request_df = clean_data(cash_request_df)
cleaned_fees_df = clean_data(fees_df)

# Save cleaned datasets
cleaned_cash_request_df.to_csv('project_dataset/cleaned_extract - cash request - data analyst.csv', index=False)
cleaned_fees_df.to_csv('project_dataset/cleaned_extract - fees - data analyst - .csv', index=False)

print("Data cleaning and preprocessing completed.")