# I will be ingesting a CSV into postgresql and normalizing it.
# Or will I normalize it partly in python and then ingest and do finishing touches?
First creating and initiating our venv:

# Normalization to-do list:   ✅

- ✅ Check for nulls, fill if possible from name field?
- ✅ Strip extra spaces
- ✅ Eliminate duplicate entries


1. Eliminate duplicated information from the table: make, model, year, engine size--basically if the field exists anywhere else in the row, then we have to cut it from the name field. Occasionally there is information in the name field that is not anywhere else in the row, so we can't throw out the field altogether.
2. Change mileage from str to int
3. Change num_owners from str eg "1st" to int eg 1
4. Cut transmission_gears from str eg "5-speech" to int eg 5
5. Change transmission_type from str "Manual" or "Automatic" to "M" or "A"
6. Cut emission type from "BS V" to just "V"
7. Change price to int. Are there any prices that do NOT end in "000"-if not, just cut them and the price field can be in lakhs
8. Change fuel type to just first letter



In [None]:
python3 -m venv venv
. venv/bin/activate

Let's ingest to a df

In [None]:
pip install pandas

In [32]:
import pandas as pd

df = pd.read_csv("/Users/bfaris96/Desktop/turing-proj/cars_db/FINAL_SPINNY_900.csv")

In [33]:
df.shape

(976, 20)

Checking for any nulls:

In [35]:
print(df[df.isnull().any(axis=1)])

Empty DataFrame
Columns: [Car_Name, Make, Model, Make_Year, Color, Body_Type, Mileage_Run, No_of_Owners, Seating_Capacity, Fuel_Type, Fuel_Tank_Capacity(L), Engine_Type, CC_Displacement, Transmission, Transmission_Type, Power(BHP), Torque(Nm), Mileage(kmpl), Emission, Price]
Index: []


Stripping extra whitespace:

In [36]:
df = df.applymap(lambda x: ' '.join(x.split()) if isinstance(x, str) else x)

In [None]:
Eliminating duplicate entries:

In [37]:
df = df.drop_duplicates()

In [38]:
df.shape

(914, 20)

I'm going to use psycopg for this with saw SQL, no ORM. I've already created the database in pgadmin. It is called car_db

pip install psycopg2
pip install python-dotenv

In [27]:
import psycopg2
import os
from dotenv import load_dotenv

load_dotenv()

DB_USER = os.getenv("DB_USER")
DB_PASSWORD = os.getenv("DB_PASSWORD")

conn = psycopg2.connect(f"host=localhost dbname=car_db user={DB_USER} password={DB_PASSWORD}")

I have to just use strings for most of these fields right now because they are so drastically denormalized and inconsistent. I'm not going to put in a primary key yet, bc I want to delete duplicates first. 

In [28]:
try:
    cur = conn.cursor()
    cur.execute("""DROP TABLE IF EXISTS cars""")
    cur.execute("""CREATE TABLE cars(
        name VARCHAR(255),
        make VARCHAR(255),
        model VARCHAR(255),
        year VARCHAR(255),
        color VARCHAR(128),
        body_style VARCHAR(128),
        mileage VARCHAR(255),
        num_owners VARCHAR(255),
        seating_capacity VARCHAR(255),
        fuel_type VARCHAR(64),
        fuel_capacity VARCHAR(255),
        engine_type VARCHAR(255),
        cc_displacement VARCHAR(255),
        transmission_gears VARCHAR(255),
        transmission_type VARCHAR(64),
        bhp FLOAT,
        torque FLOAT,
        fuel_economy VARCHAR(255),
        emission_class VARCHAR(64),
        price VARCHAR(255))
    """)
except Exception as e:
    print("An error occurred:", e)
    conn.rollback()  # rollback transaction
else:
    conn.commit()  # commit transaction

cur.close()


In [31]:
# Fetch all rows from cars table
try:
    cur = conn.cursor()
    cur.execute("""SELECT * FROM cars""")
    rows = cur.fetchall()
    for row in rows:
        print(row)
except Exception as e:
    print("An error occurred:", e)
    conn.rollback()  # rollback transaction
finally:
    cur.close()


('d7c29a76-2534-4672-97e5-2b8822f92b42', 'Volkswagen Ameo [2016-2017] Highline 1.5L AT (D)', 'Volkswagen', 'Ameo', '2017', 'silver', 'sedan', '44,611', '1st', '5', 'diesel', '45', '1.5L TDI Engine', '1498', '7-Speed', 'Automatic', 109.0, 250.0, '21.66', 'BS IV', '6,57,000')
('76358c57-1998-4e99-8adc-4264ccdd9548', 'Hyundai i20 Active [2015-2020] 1.2 SX', 'Hyundai', 'i20 Active', '2016', 'red', 'crossover', '20,305', '1st', '5', 'petrol', '45', '1.2L Kappa 5 Speed Manual Transmission', '1197', '5-Speed', 'Manual', 82.0, 115.0, '17.19', 'BS V', '6,82,000')
('557f089a-82bf-4ded-8a95-63b7d478e9ff', 'Honda WR-V VX i-VTEC', 'Honda', 'WR-V', '2019', 'white', 'suv', '29,540', '2nd', '5', 'petrol', '40', 'i-VTEC Petrol engine', '1199', '5-Speed', 'Manual', 88.5, 110.0, '16.5', 'BS IV', '7,93,000')
('1906d932-7307-4cb9-ac36-162f4b23aa6e', 'Renault Kwid 1.0 RXT AMT', 'Renault', 'Kwid', '2017', 'bronze', 'hatchback', '35,680', '1st', '5', 'petrol', '28', '1.0L', '999', '5-Speed', 'Manual', 67.0, 9

In [30]:
import csv

try: 
    cur = conn.cursor()
    cur.execute("""CREATE EXTENSION IF NOT EXISTS "uuid-ossp";""")
    with open('/Users/bfaris96/Desktop/turing-proj/cars_db/FINAL_SPINNY_900.csv', 'r') as f:
        reader = csv.reader(f)
        next(reader)  # Skip the header row.
        for row in reader:
            cur.execute("INSERT INTO cars (id, name, make, model, year, color, body_style, mileage, num_owners, seating_capacity, fuel_type, fuel_capacity, engine_type, cc_displacement, transmission_gears, transmission_type, bhp, torque, fuel_economy, emission_class, price)VALUES (uuid_generate_v4(), %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
            row)
    conn.commit()
except Exception as e:
    print("An error occurred:", e)
    conn.rollback()  # rollback transaction
finally:
    cur.close()