# Bulding model for schema

![database schema](../Asset/ER-1.png)

## Manupulation of the data

The live data of covid vaccination is found in [github](https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations.). 

Considering the following table for this visualisation

- locations.csv
- us_state_vaccinations.csv
- vaccinations-by-age-group.csv
- vaccinations-by-manufacturer.csv
- vaccinations.csv
- country_data/Wales.csv
- country_data/Canada.csv
- country_data/United States.csv
- country_data/ Denmark.csv

### Check if dataset exist or download

Look for 'covid-19-data' dir in the root folder. If not fetch the data from the respective repository.

In [1]:
import sys
sys.path.append('..')

In [2]:
from scripts.makeDataset import makeDataset

(raw_vaccination, raw_manufacture, raw_vaccine_age, raw_us_state, raw_location, raw_population, raw_iso, raw_canada) = makeDataset()

Fetching the data from github public/data/vaccinations/vaccinations.csv
Successfully fetched the file public/data/vaccinations/vaccinations.csv
Fetching the data from github public/data/vaccinations/vaccinations-by-manufacturer.csv
Successfully fetched the file public/data/vaccinations/vaccinations-by-manufacturer.csv
Fetching the data from github public/data/vaccinations/vaccinations-by-age-group.csv
Successfully fetched the file public/data/vaccinations/vaccinations-by-age-group.csv
Fetching the data from github public/data/vaccinations/us_state_vaccinations.csv
Successfully fetched the file public/data/vaccinations/us_state_vaccinations.csv
Fetching the data from github public/data/vaccinations/locations.csv
Successfully fetched the file public/data/vaccinations/locations.csv
Fetching the data from github scripts/input/un/population_latest.csv
Successfully fetched the file scripts/input/un/population_latest.csv
Fetching the data from github scripts/input/iso/iso.csv
Successfully fet

Let's start with the data cleaning and manupulate as per the requirement.

In [3]:
import os
from scripts.csvRW import read_csv, create_csv, read_csv_without_header, append_row_to_csv

directory = '../model'
if not os.path.exists(directory):
    os.makedirs(directory)
    print(f"Directory {directory} created.")
else:
    print(f"Directory {directory} already exists.")

Directory ../model already exists.


#### Source Schema
```
Source(_source_id_, source_name, source_website)
```

In [4]:
result = list(set([(data['source_name'],data['source_website']) for data in raw_location]))
result.append(['',''])
create_csv("../model/Source.csv",result)

File is successfully created at ../model/Source.csv


#### Country Schema 

Read all country iso
```
Country(_iso_code_*,location, source_id*, last_observation_date)
```

In [5]:

resultLocation ={}
for data in raw_iso:
    resultLocation[data['iso_code']] = data["location"]

source = read_csv_without_header("../model/Source.csv")

result = []
for data in raw_location:
    if [data['source_name'],data['source_website']] in source:
        result.append([data['iso_code'],data['location'],source.index([data['source_name'],data['source_website']])+1, data['last_observation_date']])
        del resultLocation[data['iso_code']]
    else:
        print("Missed country",data['iso_code'])

for iso_code, location in resultLocation.items():
    result.append([iso_code,location,108,''])

create_csv("../model/Country.csv",result)

File is successfully created at ../model/Country.csv


#### Age_Group Schema
```
Age_group(_age_group_)
```

In [6]:
result = set([data['age_group']for data in raw_vaccine_age])
result = [[data]for data in result]
create_csv("../model/Age_group.csv",result)

File is successfully created at ../model/Age_group.csv


#### Vaccine Schema
```
Vaccine(_vaccine_)
```

In [7]:
result = []
for data in raw_location:
    result.append(data['vaccines'].split(","))

result = [x.strip() for xs in result for x in xs]

for data in raw_manufacture:
    result.append(data['vaccine'].strip())

result = [[val]for val in list(set(result))]

create_csv("../model/Vaccine.csv",result)

File is successfully created at ../model/Vaccine.csv


#### Population_Country schema
```
Population_Country(_iso_code_*, year, population)
```

In [8]:
raw_data_loc = read_csv_without_header("../model/Country.csv")
raw_data_loc = [data[0] for data in raw_data_loc]

result = [[data['iso_code'],data['year'],data['population']] for data in raw_population if data['iso_code'] in raw_data_loc]

create_csv("../model/Population_Country.csv",result)

File is successfully created at ../model/Population_Country.csv


#### Country_Vaccine Schema
```
Country_Vaccine(_iso_code_*,_vaccine_*)
```


In [9]:
raw_data_iso = read_csv_without_header("../model/Country.csv")
raw_data_iso = [data[0] for data in raw_data_iso]

raw_data_vac = read_csv_without_header("../model/Vaccine.csv")
raw_data_vac = [data[0] for data in raw_data_vac]

result = []
for data in raw_location:
    vaccineList = [val.strip() for val in data['vaccines'].split(",")]

    if data['iso_code'] in raw_data_iso:
        for vaccine in vaccineList:
            if vaccine in raw_data_vac:
                result.append((data['iso_code'],vaccine))
            else:
                print("No entered vaccine table ",vaccine)
    else:
        print("No entered country table ",data['iso_code'])

create_csv("../model/Country_Vaccine.csv",set(result))


File is successfully created at ../model/Country_Vaccine.csv


#### Manufacture_date Schema

```Manufacture_date(_iso_code_*, _date_, _vaccine_*, total_vaccinations)```

In [10]:
raw_data_iso = read_csv_without_header("../model/Country.csv")
iso_map = {}

for data in raw_data_iso:
    iso_map[data[1]] = data[0]
    
result = []

for data in raw_manufacture:
    if data['location'] in iso_map.keys():
        if data['vaccine'] in raw_data_vac:
           result.append([iso_map[data['location']],data['date'],data['vaccine'], data['total_vaccinations']])
        else:
            print(data['vaccine']," is not in vaccine table")
            break
    else:
        print(data['location']," is not in country table")  

create_csv("../model/Manufacture_date.csv",result)

File is successfully created at ../model/Manufacture_date.csv


#### Vaccination Schema

```Vaccination(_iso_code_*, _date_, total_vaccinations, people_vaccinated, people_fully_vaccinated, total_booster, daily_vaccination_raw, daily_vaccination, daily_people_vaccinated)```

In [11]:
iso_list = iso_map.values()

result = []
for data in raw_vaccination:
    if data['iso_code'] in iso_list:
        pass
    else:
        print("Missing iso code in country table",data['iso_code'])
        append_row_to_csv("../model/Country.csv",[data['iso_code'],data['location'],'',''])
    
    result.append([data['iso_code'],data['date'],data['total_vaccinations'],data['people_vaccinated'],data['people_fully_vaccinated'],data['total_boosters'],data['daily_vaccinations_raw'],data['daily_vaccinations'], data['daily_people_vaccinated']])

# Reload the data of country table

create_csv("../model/Vaccination.csv",result)

File is successfully created at ../model/Vaccination.csv


#### State_Vaccination Schema

```State_Vaccinations(_iso_code_*, _date_*, _states_, total_vaccinations, total_distributed, people_vaccinated, people_fully_vaccinated, daily_vaccination_raw, daily_vaccinations, share_doses_used, total_boosters, population)```

In [12]:
model_vac = read_csv_without_header("../model/Vaccination.csv")

iso_code ='USA'
fk = [data[1] for data in model_vac if data[0] == iso_code]

result = []
for data in raw_us_state:
    if data['date'] in fk:
        population = 0
        if str(data['total_vaccinations_per_hundred']).replace('.','',1).isdigit() and float(data['total_vaccinations_per_hundred']) > 0:
            population = (float(data['total_vaccinations'])/float(data['total_vaccinations_per_hundred']))*100
            
        result.append([iso_code,data['date'],data['location'],data['total_vaccinations'],data['total_distributed'],data['people_vaccinated'],data['people_fully_vaccinated'],data['daily_vaccinations_raw'],data['daily_vaccinations'],data['share_doses_used'],data['total_boosters'],population])
    else:
        print(f"{data['date']} is not in vaccination table for {iso_code}")

create_csv("../model/State_Vaccination.csv",result)

2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is not in vaccination table for USA
2023-05-10 is

#### Vaccination_Age Schema 
```Vaccination_age(iso_code*, date*, age_group*, people_vaccinated_per_hundred, people_fully_vaccinated_per_hundred, people_with_booster_per_hundred)```

In [13]:
fk = [f"{data[0]}{data[1]}" for data in model_vac]

result = []
for data in raw_vaccine_age:
    if f"{iso_map[data['location']]}{data['date']}" in fk:
        result.append([iso_map[data['location']],data['date'],data['age_group'],data['people_vaccinated_per_hundred'],data["people_fully_vaccinated_per_hundred"],data["people_with_booster_per_hundred"]])
    else:
        print("Not in parent vaccination table",data['location'],data['date'])

create_csv("../model/Vaccination_age.csv",result)

Not in parent vaccination table Argentina 2020-01-01
Not in parent vaccination table Argentina 2020-01-01
Not in parent vaccination table Argentina 2020-01-01
Not in parent vaccination table Argentina 2020-01-01
Not in parent vaccination table Argentina 2020-01-01
Not in parent vaccination table Argentina 2020-01-01
Not in parent vaccination table Argentina 2020-01-01
Not in parent vaccination table Argentina 2020-01-01
Not in parent vaccination table Argentina 2020-01-01
Not in parent vaccination table Argentina 2020-01-01
Not in parent vaccination table Argentina 2020-01-01
Not in parent vaccination table Argentina 2020-01-02
Not in parent vaccination table Argentina 2020-01-02
Not in parent vaccination table Argentina 2020-01-02
Not in parent vaccination table Argentina 2020-01-02
Not in parent vaccination table Argentina 2020-01-02
Not in parent vaccination table Argentina 2020-01-02
Not in parent vaccination table Argentina 2020-01-02
Not in parent vaccination table Argentina 2020

On the data manupulation, the csv for respective schema is build on ../model directory and it can be imported to it's respective table in sqllite.

The creation of database and insertion of data in the table is carried out in [database pipeline](./databasePipeline.ipynb) notebook