## 0. Explanation

[Initially](https://github.com/fengyu20/sql-project-star-wars-analysis/tree/master/sqlite3_firstversion), I completed the project without addressing table connections.

Upon reviewing the data and examining [the project](https://github.com/san089/Udacity-Data-Engineering-Projects/tree/master/Data_Api_to_Postgres) from the Udacity Data Engineering Program—which involves building a simple ETL pipeline to fetch data and store it in a PostgreSQL database—I realized a better structure could enhance my code.

Therefore, by following the pipeline structure and addressing table connections, I developed a second version using sqlite3. This project could be also found on the [Github](https://github.com/fengyu20/sql-project-star-wars-analysis/tree/master/sqlite3_secondversion).

Additionally, I realized there wasn't enough time to adapt it to a PostgreSQL version. I plan to implement this adaptation in the near future.

## 1. Request Data from Star Wars API
This part is modified based on `request.py` from the Udacity Data Engineer Project.

In [24]:
pip install requests


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import requests
import json
import sqlite3

In [3]:
class RequestStarWars:
    def __init__(self):
        self._base_urls = {
            "films": "https://swapi.dev/api/films/",
            "people": "https://swapi.dev/api/people/",
            "planets": "https://swapi.dev/api/planets/",
            "species": "https://swapi.dev/api/species/",
            "starships": "https://swapi.dev/api/starships/",
            "vehicles": "https://swapi.dev/api/vehicles/"
        }
        
    def get_table_names(self):
        return list(self._base_urls.keys())

    def get_content(self, type):
        url = self._base_urls[type]
        all_data = []

        while url:
            response = requests.get(url)
            if response.status_code == 200:
                response_data = response.json()
                all_data.extend(response_data['results'])
                
                url = response_data["next"]

            else:
                print(f"Request completed with Error. Response Code : {response.status_code}")
                break
        # get id from url    
        for item in all_data:
            item['id'] = int(item['url'] .strip('/').split('/')[-1])
            
        return all_data

In [4]:
# follow the don't repeat yourself principle
# define functions to read and write json data. 
def write_to_file(file_path, data):
    with open(file_path, 'w') as file:
        json.dump(data, file, indent=4)

def read_from_file(file_path):
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data

# extract ids from urls
def extract_ids(url_or_urls):
    if isinstance(url_or_urls, str): 
        return [url_or_urls.split('/')[-2]]
    elif isinstance(url_or_urls, list):  
        return [url.split('/')[-2] for url in url_or_urls]
    else:
        raise ValueError(f"Unexpected type: {type(url_or_urls)}")

In [5]:
import os
os.makedirs('data', exist_ok=True)

request_api = RequestStarWars()
table_names = request_api.get_table_names()

for table_name in table_names:
    data = request_api.get_content(table_name)
    file_path = f'data/{table_name}.json'
    write_to_file(file_path, data)

In [6]:
# use a single table for the test
'''
request_api = RequestStarWars()

data = request_api.get_content('films')
with open(f'data/films.json', 'w') as file:
    json.dump(data, file, indent=4)
'''

"\nrequest_api = RequestStarWars()\n\ndata = request_api.get_content('films')\nwith open(f'data/films.json', 'w') as file:\n    json.dump(data, file, indent=4)\n"

In [7]:
# examine the content
single_file_path = "data/films.json"
result = read_from_file(single_file_path)
for item in result:
        print(item)     

{'title': 'A New Hope', 'episode_id': 4, 'opening_crawl': "It is a period of civil war.\r\nRebel spaceships, striking\r\nfrom a hidden base, have won\r\ntheir first victory against\r\nthe evil Galactic Empire.\r\n\r\nDuring the battle, Rebel\r\nspies managed to steal secret\r\nplans to the Empire's\r\nultimate weapon, the DEATH\r\nSTAR, an armored space\r\nstation with enough power\r\nto destroy an entire planet.\r\n\r\nPursued by the Empire's\r\nsinister agents, Princess\r\nLeia races home aboard her\r\nstarship, custodian of the\r\nstolen plans that can save her\r\npeople and restore\r\nfreedom to the galaxy....", 'director': 'George Lucas', 'producer': 'Gary Kurtz, Rick McCallum', 'release_date': '1977-05-25', 'characters': ['https://swapi.dev/api/people/1/', 'https://swapi.dev/api/people/2/', 'https://swapi.dev/api/people/3/', 'https://swapi.dev/api/people/4/', 'https://swapi.dev/api/people/5/', 'https://swapi.dev/api/people/6/', 'https://swapi.dev/api/people/7/', 'https://swapi.de

In [8]:
connections = {
    "people_connections": {
        "homeworld": {},
        "films": {},
        "species": {},
        "vehicles": {},
        "starships": {}
    },
    "films_connections": {
        "characters": {},
        "species": {},
        "vehicles": {},
        "starships": {},
        "planets": {}
    },
    "starships_connections":{
        "films":{},
        "pilots":{}
    },
    "vehicles_connections":{
        "films":{},
        "pilots":{}
    },
    "species_connections":{
        "people":{},
        "films":{}        
    },
    "planets_connections":{
        "residents":{},
        "films":{}
    }
}


In [9]:
# create a function that could be used to save connections among tables.
def create_connections(table_name, connections_dict):
    table_connections = {}

    file_path = f"data/{table_name}.json"
    table_data = read_from_file(file_path)
    for item in table_data:
        # for every one, we create a new dict to store the connections
        # made a mistake here, using 'copy' 
        person_connections = connections_dict.copy()
        #print(person_connections)
        for key in person_connections.keys():
            #print(key)
            if item.get(key):
                ids = extract_ids(item[key])
            person_connections[key] = ids
        table_connections[item['id']] = person_connections
    
    return table_connections

In [10]:
# save connections to json files
for table_name in table_names:
    connection_file_path = f'data/{table_name}_connections.json'
    connection_name = table_name + '_connections'
    data = create_connections(table_name, connections[connection_name])
    write_to_file(connection_file_path, data)

## 2. Create Schemas and Build Connections

In [11]:
db_name = 'starwars.db'

In [12]:
# define the data schmea and write relevant SQL statements according to the documentation
# the main change here is add id column for each table

create_tables_ddls = ['''
CREATE TABLE IF NOT EXISTS people (
    id INTEGER PRIMARY KEY,
    name TEXT,
    birth_year TEXT,
    eye_color TEXT,
    gender TEXT,
    hair_color TEXT,
    height TEXT,
    mass TEXT,
    skin_color TEXT
);
''',
'''
CREATE TABLE IF NOT EXISTS films (
    id INTEGER PRIMARY KEY,
    title TEXT,
    episode_id INTEGER,
    opening_crawl TEXT,
    director TEXT,
    producer TEXT,
    release_date DATE
);
''',
'''
CREATE TABLE IF NOT EXISTS starships (
    id INTEGER PRIMARY KEY,
    name TEXT,
    model TEXT,
    starship_class TEXT,
    manufacturer TEXT,
    cost_in_credits TEXT,
    length TEXT,
    crew TEXT,
    passengers TEXT,
    max_atmosphering_speed TEXT,
    hyperdrive_rating TEXT,
    MGLT TEXT,
    cargo_capacity TEXT,
    consumables TEXT
);
''',
'''
CREATE TABLE IF NOT EXISTS vehicles (
    id INTEGER PRIMARY KEY,
    name TEXT,
    model TEXT,
    vehicle_class TEXT,
    manufacturer TEXT,
    length TEXT,
    cost_in_credits TEXT,
    crew TEXT,
    passengers TEXT,
    max_atmosphering_speed TEXT,
    cargo_capacity TEXT,
    consumables TEXT
);
''',
'''
CREATE TABLE IF NOT EXISTS species (
    id INTEGER PRIMARY KEY,
    name TEXT,
    average_height TEXT,
    average_lifespan TEXT,
    classification TEXT,
    designation TEXT,
    eye_colors TEXT,
    hair_colors TEXT,
    homeworld TEXT,
    language TEXT,
    skin_colors TEXT
);
''',
'''
CREATE TABLE IF NOT EXISTS planets (
    id INTEGER PRIMARY KEY,
    name TEXT,
    diameter TEXT,
    rotation_period TEXT,
    orbital_period TEXT,
    gravity TEXT,
    population TEXT,
    climate TEXT,
    terrain TEXT,
    surface_water TEXT
);
'''
]

In [13]:
# pick up some connections that will used for data analysis, mainly in the people table.
create_connections_ddls = [
    '''
    -- people_films
    CREATE TABLE IF NOT EXISTS people_films (
    person_id INTEGER,
    film_id INTEGER,
    PRIMARY KEY (person_id, film_id),
    FOREIGN KEY (person_id) REFERENCES people(id),
    FOREIGN KEY (film_id) REFERENCES films(id));
    ''',
    '''
    --people_species
    CREATE TABLE IF NOT EXISTS people_species (
    person_id INTEGER,
    specie_id INTEGER,
    PRIMARY KEY (person_id, specie_id),
    FOREIGN KEY (person_id) REFERENCES people(id),
    FOREIGN KEY (specie_id) REFERENCES species(id)
    );
    ''',
    ''' 
    --people_starships
    CREATE TABLE IF NOT EXISTS people_starships (
    person_id INTEGER,
    starship_id INTEGER,
    PRIMARY KEY (person_id, starship_id),
    FOREIGN KEY (person_id) REFERENCES people(id),
    FOREIGN KEY (starship_id) REFERENCES starships(id)
    );
    ''',
    '''
    -- people_vehicles
    CREATE TABLE IF NOT EXISTS people_vehicles (
    person_id INTEGER,
    vehicle_id INTEGER,
    PRIMARY KEY (person_id, vehicle_id),
    FOREIGN KEY (person_id) REFERENCES people(id),
    FOREIGN KEY (vehicle_id) REFERENCES vehicles(id)
    );
    ''',
    '''
    -- people_planets
    CREATE TABLE IF NOT EXISTS people_planets (
    person_id INTEGER,
    planet_id INTEGER,
    PRIMARY KEY (person_id, planet_id),
    FOREIGN KEY (person_id) REFERENCES people(id),
    FOREIGN KEY (planet_id) REFERENCES planets(id)
    );
    '''
]

## 3. Insert values to the tables

In [14]:
class DatabaseDriver:
    def __init__(self, db_name):
        self._conn = sqlite3.connect(db_name)
        self._cur = self._conn.cursor()

    def create_tables(self, ddl_statements):
        for statement in ddl_statements:
            self._cur.execute(statement)
        self._conn.commit()

    def execute_query(self, query, params=None):
        if params:
            self._cur.execute(query, params)
        else:
            self._cur.execute(query)
        self._conn.commit()

    def save_data_to_table(self, table_name, item, primary_key_field):
        '''
        Insert individual data to the table.
        Item is a dict.
        '''    
        # use serialization to convert lists to strings
        columns_to_exclude = list(connections.get(f'{table_name}_connections', {}).keys())
        columns_to_exclude.extend(["url","created","edited"])
        filtered_item = {k: v for k, v in item.items() if k not in columns_to_exclude}
        #print(filtered_item)

        # generalize the Insert into and avoid duplicates SQL statement
        column_name = ', '.join(filtered_item.keys())
        place_holder = ', '.join(['?'] * len(filtered_item))
        primary_key_value = filtered_item.get(primary_key_field)
        #print(column_name)
        insert_sql = f"INSERT INTO {table_name} ({column_name}) VALUES ({place_holder})"
        check_sql = f"SELECT * FROM {table_name} WHERE {primary_key_field} = ?"
        
        self.execute_query(check_sql, (primary_key_value,))
        if self._cur.fetchone() is None:
            self.execute_query(insert_sql, tuple(filtered_item.values()))

    def save_connections(self, table_name, primary_key_field):
        data = read_from_file(f"data/{table_name}_connections.json")
        for primary_id, connections in data.items():
            id = int(primary_id)
            for connection_type, connection_ids in connections.items():
                connection_table_name = f'{table_name}_{connection_type}'
                for connection_id in connection_ids:
                    connection_id = int(connection_id)
                    # print(connection_type[:-1])
                    insert_sql = f'''
                        INSERT INTO {connection_table_name} ({primary_key_field}, {connection_type[:-1]}_id)
                        VALUES (?, ?)
                        ON CONFLICT ({primary_key_field}, {connection_type[:-1]}_id) DO NOTHING;
                    '''
                    self.execute_query(insert_sql,(id, connection_id))
                    
    def close(self):
        self._cur.close()
        self._conn.close()

In [15]:
# create tables
connect_star_wars = DatabaseDriver(db_name)
connect_star_wars.create_tables(create_tables_ddls)
connect_star_wars.create_tables(create_connections_ddls)

# insert data to tables from json files
for table_name in table_names:
    table_file_path = f'data/{table_name}.json'
    data = read_from_file(table_file_path)
    for item in data:
        connect_star_wars.save_data_to_table(table_name, item, 'id')

In [16]:
# clean the people data
people_data = read_from_file("data/people_connections.json")
for person_id, person_data in people_data.items():
    if person_data.get('homeworld'):
        person_data['planets'] = person_data.pop('homeworld')

write_to_file("data/people_connections.json", people_data)

In [17]:
# insert connections into the table
connect_star_wars.save_connections('people','person_id')

connect_star_wars.close()

## 4. Data Validation
- This part is implemented in the first version.
- Check if we have collected the right amount of data from the API.

In [18]:
# check the count of table data matches the API count or not
def check_table_rows(table_name):
    with sqlite3.connect('starwars.db') as conn:
        c = conn.cursor()
        rows_count = c.execute(f"SELECT count(*) FROM {table_name}").fetchone()[0]
    return rows_count

def check_api_count(table_name):
    response = requests.get(f'https://swapi.dev/api/{table_name}')

    api_count = response.json()["count"]
    rows_count =check_table_rows(table_name)

    if api_count > rows_count:
        print(f"Check Table {table_name}, there are missing rows. Table has {rows_count} raws, it is supposed to have {api_count} raws.")
    elif api_count > rows_count:
        print(f"Check Table {table_name}, there are more rows than expected. Table has {rows_count} raws, it is supposed to have {api_count} raws.")        
    else:
        print(f"Table {table_name} passes the data validatation.")


In [19]:
# call each table for the validation
table_names = ['people', 'films', 'starships', 'vehicles', 'species', 'planets']

for table_name in table_names:
    check_api_count(table_name)

Table people passes the data validatation.
Table films passes the data validatation.
Table starships passes the data validatation.
Table vehicles passes the data validatation.
Table species passes the data validatation.
Table planets passes the data validatation.


## 5. Key Takeaways:

1. Before implementing code, it's essential to understand the data and define the question that needs to be solved. For instance, if we need to join tables, it's not appropriate to convert the list of URLs to a string. Instead, we should extract IDs from the URLs and create connection tables.
2. Review similar online projects to understand the best practices concerning related issues. This can save a significant amount of time and effort in the future.
3. Code-related Takeaways:
   1. If we want to create a new dictionary inside the function. We could use .copy() to do it.
   2. For the insert SQL statement, the query should be passed by parameters and also in the form of tuples.
   3. Before manipulating data directly in the databases(it's hard to do in Sqlite3), we could use Json to process the data. For exmaple, using `.pop` to update the column name.