## Brainstorming
#### Analysis Questions 
- What was the original size of the data set chosen, and was it expanded to meet the requirement of 5,000 observations (5,000 rows)? 
- What method was used to expand the data set to contain 5,000 rows: (AI was used based on the trend of the data set) 
- How did this expansion method affect the data set?

#### Data Quality Assessment
- **First**: Remove the non-number, NaN, values
    - `.dropna()`
    - When removing all the NaN values, only 44 rows remain
    - Need to find all NaN values with `.isna()`, which returns a boolean mask
- Check data types using `.info()`
    - displays: Column names, types, non-null counts, memory
- Check statistical description using `.describe()`
- Use a random function to randomly extract 7500 rows/observations from the dataset
- Use `.value_counts()` for frequency count of each category
- Convert all NaN to random number values based on the mean of each column

In [1]:
import numpy as np
import pandas as pd

# loading Dataset from a CSV file #
Eng_raw = pd.read_csv("data/engineering_messy_dataset.csv")

# Eng_cleaning_qm = Eng_raw.replace('??', np.nan)
# Eng_cleaning_qm => Engineering, cleaning question marks
# Eng_clean = Eng_raw.dropna()

Eng_raw_NaN = Eng_raw.isna()
# pd.to_datetime(Eng_raw['Last_Maintenance'])

# Eng_Repaired_count = Eng_raw.query('Repaired == "Y" or Repaired == "yEs" or Repaired == "No" or Repaired == "N" or Repaired == "Yes" or Repaired == "yes" or Repaired == "no"').count()
# Eng_Repaired_count = Eng_Repaired_count[['Repaired']]

print(f"\n{Eng_raw['Repaired'].unique()}\n")
Eng_raw["Repaired"] = (
    Eng_raw["Repaired"].str.lower().replace({"yes": "Yes", "y": "Yes", "no": "No", "n": "No"})
)
print(f"\n{Eng_raw['Repaired'].unique()}\n")
print(f"\n{Eng_raw['Technician'].unique()}\n")
Eng_raw["Technician"] = Eng_raw["Technician"].replace({"j@mes": "James"})
print(f"\n{Eng_raw['Technician'].unique()}\n")
print(f"\n{Eng_raw['Status'].unique()}\n")
Eng_raw["Status"] = Eng_raw["Status"].replace(
    {"Inactive": "inactive", "ACTIVE": "active", "Active": "active"}
)
print(f"\n{Eng_raw['Status'].unique()}\n")

# Extracting 7500 rows from dataset #
# Eng_cleanSize = Eng_clean.sample(n=7500, random_state=1)
# print(f"\n{Eng_raw.dtypes}\n")
print(f"\n{Eng_raw.info()}\n")
print(f"{Eng_raw.describe()}\n")
print(f"Shape/dimensions of DataFrame: {Eng_raw.shape}\n")
print(f"Number of Elements: {Eng_raw.size}\n")

Eng_raw.head(20)


['Y' 'yEs' 'No' 'N' nan 'Yes' 'yes' '??' 'no']


['Yes' 'No' nan '??']


['Mike' 'Paul' 'Tomi' 'j@mes' 'Ada' 'Ola' 'John' 'Janet' 'Jane' nan
 'James']


['Mike' 'Paul' 'Tomi' 'James' 'Ada' 'Ola' 'John' 'Janet' 'Jane' nan]


['faulty' 'active' 'Inactive' 'inactive' 'ACTIVE' 'Active' nan 'repair']


['faulty' 'active' 'inactive' nan 'repair']

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Component_ID      50000 non-null  object 
 1   Name              50000 non-null  object 
 2   Material          50000 non-null  object 
 3   Weight(kg)        30025 non-null  object 
 4   Length(cm)        25102 non-null  object 
 5   Width(cm)         16400 non-null  float64
 6   Temp(C)           30031 non-null  object 
 7   Pressure(bar)     24936 non-null  object 
 8   Status            43752 non-null  object 
 9   Last_Maintenance  25203 

Unnamed: 0,Component_ID,Name,Material,Weight(kg),Length(cm),Width(cm),Temp(C),Pressure(bar),Status,Last_Maintenance,Technician,Efficiency(%),Notes,Cost(₦),Batch,Remarks,Production_Date,Fault_Code,Repaired,Comments
0,CMP00001,nozzle,??,,146,,,,faulty,10/18/2024,Mike,??,good cond.,,B3,fine,,F007,Yes,replaced
1,CMP00002,piston,Titanium,??,,,xx,11.6,active,11/28/2024,Paul,??,broken,12157,B1,weak,10/13/2024,F005,Yes,fine
2,CMP00003,nozzle,Aluminium,33,96,,,,inactive,11/19/2024,Tomi,,broken,59545,B6,good,,F007,No,delayed
3,CMP00004,valve,Brass,,,27.0,,,inactive,,James,,rust forming,??,,bad,,F001,Yes,fine
4,CMP00005,rotor,steel,7.45,,,??,11.54,inactive,11/3/2024,James,??,cleaned,,,none,,F004,No,ok
5,CMP00006,pump,??,,??,50.0,??,,inactive,,Mike,??,slight rust,??,B3,ok,,F004,No,fine
6,CMP00007,bolt,??,??,??,55.0,,,active,,Paul,??,slight rust,4178,B3,replace soon,,,,done
7,CMP00008,nozzle,Aluminium,2.39,??,,xx,7.92,inactive,,Paul,,dirty,??,B6,bad,,F006,Yes,replaced
8,CMP00009,bolt,??,68,136,,??,7.59,active,11/12/2024,Ada,??,broken,2370,B1,good,,F008,Yes,
9,CMP00010,bolt,??,,52,,,,inactive,,Ola,??,crack on edge,31507,B5,!,,F007,Yes,ok
