# Standardization/normalization to-do list:   ✅

- ✅ Check for nulls, fill if possible from name field?
- ✅ Strip extra spaces
- ✅ Convert all case to lower case
- ✅ Eliminate duplicate entries
- ✅ Remove all non-numeric chars from numeric fields
- ✅ Standardize as many fields as possible

1. ✅ Eliminate duplicated information from the table: make, model, year, engine size--basically if the field exists anywhere else in the row, then we have to cut it from the name field. Occasionally there is information in the name field that is not anywhere else in the row, so we can't throw out the field altogether.
2. ✅ Change mileage from str to int
3. ✅ Change num_owners from str eg "1st" to int eg 1
4. ✅ Cut transmission_gears from str eg "5-speech" to int eg 5 and 'cvt' (continuous transmission to 'c')
5. ✅ Change transmission_type from str "Manual" or "Automatic" to "M" or "A"
6. ✅ Cut emission type from "BS V" to just "5"
7. ✅ Change price to int
8. ✅ Change fuel type to just first letter
9. ✅ Try to standardize Engine_Type field, eliminate duplicate data
10. ✅ Split out multiple data points from Engine_Type field (also move out drive train bc this is not engine data). Could split our cylinders number, turbo boolean, valve number



# Additional questions I would ask stakeholders/users:

- Are there other possible values we will see in the future that I should account for?
- I've assigned all information from the name field into the relevant other fields. Do you need me to further refine and standardize the engine type field? I did not extract the engine cylinder information
- Are any of these remaining fields contingent on any other fields? (They don't appear to be)
- We should find another dataset to fill the information about drive train, this is a criterion that customers may be interested to search by and we have little data on this. 
- Should we revert year back to an integer? (instead of the arbitrary 2017-01-01 we have to have for postgres to recognize a date)
- ❌ We have contradictory information for these vehicles. Is there a way to verify the correct information? For now I have left the mileage field as it is. See cell below:


In [None]:

Car_Name	Make	Model	Make_Year	Color	Body_Type	Mileage_Run	No_of_Owners	Seating_Capacity	Fuel_Type	Fuel_Tank_Capacity(L)	Engine_Type	CC_Displacement	Transmission	Transmission_Type	Power(BHP)	Torque(Nm)	Mileage(kmpl)	Emission	Price
322	maruti suzuki baleno [2019-2020] alpha diesel	maruti suzuki	baleno	2019	blue	hatchback	30,420	1st	5	diesel	37	ddis diesel engine	1248	5-speed	automatic	74.0	190.0	bs iv	bs v	8,83,000
446	maruti suzuki baleno [2019-2020] alpha diesel	maruti suzuki	baleno	2019	blue	hatchback	37,942	1st	5	diesel	37	ddis diesel engine	1248	5-speed	automatic	74.0	190.0	bs iv	bs v	8,47,000

## Additional possibilities for normalization: 

Create an engine table. Some of the characteristics of the car may be contingent on engine_type, and it could make sense to split that out into another table. I don't really know enough about cars and cars in India to do that yet, but it is something to consider, if we want to get to 3NF.

# Normalization Checklist:

1. First Normal Form (1NF)

- It only has atomic (indivisible) values. In other words, each cell should contain only one value, not a set of values or empty sets.
- Entries in a column are of the same kind. Each column should be of the same type (numeric, text, date, etc.).
- Each column in a table should represent a single attribute of the entity modelled by the table (e.g., a 'car' table might have separate columns for 'make', 'model', 'year', etc.)
- Order in which data is saved does not matter.

--

2. Second Normal Form (2NF)

- It is in 1NF.
- All non-key columns are fully dependent on the primary key. A non-key column must be functionally dependent on the entire set of primary key(s). There should be no partial dependency.
- In other words, if a column is dependent on only part of a multi-part primary key, then the table fails 2NF.

--

3. Third Normal Form (3NF)

- It is in 2NF.
- It has no transitive dependencies. A transitive dependency occurs when a non-key column is dependent on another non-key column, which is dependent on the primary key.
- Every non-key attribute must be functionally dependent on the primary key directly and not through some other non-key attributes.






