# Project Overview

This project aims to analyze the data from the total imports of the top 25 industries in Canada. The objectives are to discover trends and patterns to predict the behavior of these industries for the current year, 2024.

The data was collected from the official website of the [Trade Data Online by the Government of Canada](https://www.ic.gc.ca/app/scr/tdst/tdo/crtr.html?reportType=TI&grouped=GROUPED&searchType=All&timePeriod=5%7cComplete+Years&currency=CDN&naArea=9999&countryList=ALL&productType=NAICS&toFromCountry=CDN&changeCriteria=true).

## Data Collection Criteria

- **Trade Type:** Total imports
- **Trader:** Canada
- **Trading Partner:** All countries (Total)
- **Time Period (Specific Years):** 2019, 2020, 2021, 2022, 2023
- **Value:** $ Canadian (current dollars)
- **Industry:** Top 25 industries (5-digit NAICS codes)

## Libraries Used for Analysis

- **Pandas:** Used for data preparation and cleaning. It allows for easy loading of data from CSV files, handling missing values, and merging multiple datasets into a single DataFrame for comprehensive analysis.
- **NumPy:** Used for efficient data manipulation and numerical computations. It provides support for large, multi-dimensional arrays and matrices, which are essential for handling the dataset and performing various mathematical operations.
- **scikit-learn:** Used for applying machine learning algorithms to the data. Specifically, it will be used for linear regression to identify trends and make predictions about the behavior of the top 25 industries in Canada.
- **SQLAlchemy:** Used for connecting to the SQLite database, creating tables, and inserting data from CSV files into the database.


##Project Structure
## Project Structure
- **clean_data.py:** Python script for cleaning the data and making predictions.
- **data/:** Directory containing the raw and cleaned data files.
- **notebooks/:** Directory containing Jupyter notebooks for data analysis and documentation.
- **scripts/:** Directory containing scripts for database setup and data loading.
- **README.md:** Project documentation.
- **requirements.txt:** List of dependencies required for the project.


##How to run the code
[gicolls](https://github.com/gifcolls/canadian-imports-data-analysis)
cd canadian-imports-data-analysis

##Install dependencies
Make sure you have pandas, numpy, and scikit-learn installed. If not, install them using pip:
pip install -r requirements.txt

##Run the script:
../scripts/clean_data.py
##Output:
The cleaned data will be saved as cleaned_data.csv in the data/ directory. The script will also output the predicted import value for the current year.





In [2]:
#Import Libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sqlalchemy import create_engine
import sqlite3


## Database Setup

We use a Python script to load data from CSV files into a SQLite database. This script reads the CSV files, processes them, and inserts the data into the database.

## Running Setup Script
We will run the setup script to initialize the database and load data.

In [3]:
# Run the setup script
%run ../scripts/setup_database.py


First few rows of the data for 2019:
                                            Category      Value  Year
1  33611 - Automobile and light-duty motor vehicl...  52045.933  2019
2  21111 - Oil and gas extraction (except oil sands)  22305.242  2019
3  32541 - Pharmaceutical and medicine manufacturing  21545.832  2019
4  33641 - Aerospace product and parts manufacturing  19300.922  2019
5                       32411 - Petroleum refineries  18598.200  2019
First few rows of the data for 2020:
                                            Category      Value  Year
1  33611 - Automobile and light-duty motor vehicl...  41140.574  2020
2  32541 - Pharmaceutical and medicine manufacturing  22573.708  2020
3  33411 - Computer and peripheral equipment manu...  16842.468  2020
4                 21222 - Gold and silver ore mining  14663.445  2020
5  33641 - Aerospace product and parts manufacturing  13598.319  2020
First few rows of the data for 2021:
                                            Categ

##Verifying the data 

In [4]:
#add connection conn
# Database connection with absolute path
db_path = 'C:/Users/berli/canadian-imports-data-analysis/canadian_imports.db'
conn = sqlite3.connect(db_path)

# Query check
query = "SELECT * FROM imports LIMIT 10"
df_check = pd.read_sql_query(query, conn)
print("First 10 rows from the full data query")
print(df_check)

First 10 rows from the full data query
                                            Category      Value  Year
0  33611 - Automobile and light-duty motor vehicl...  52045.933  2019
1  21111 - Oil and gas extraction (except oil sands)  22305.242  2019
2  32541 - Pharmaceutical and medicine manufacturing  21545.832  2019
3  33641 - Aerospace product and parts manufacturing  19300.922  2019
4                       32411 - Petroleum refineries  18598.200  2019
5  33411 - Computer and peripheral equipment manu...  16191.013  2019
6    33639 - Other motor vehicle parts manufacturing  15572.479  2019
7  33451 - Navigational, measuring, medical and c...  10322.429  2019
8             33612 - Heavy-duty truck manufacturing   9656.805  2019
9  33422 - Radio and television broadcasting and ...   9644.989  2019


## Exploratory Data Analysis (EDA)

### Data Loading
We loaded the import data from a SQLite database into a Pandas DataFrame.

### Data Overview
The dataset contains the following columns: `Year`, `Industry`, `ImportValue`. Below is a summary of the data:

In [None]:
# Display the first few rows of the dataset
print(df.head())

## Conclusion

We have successfully set up a SQLite database and loaded Canadian import data from 2019 to 2023. The data is now ready for further analysis and visualization.

## Next Steps

- Perform data analysis to uncover trends and insights.
- Create visualizations to better understand the data.
- Extend the database with additional data sources if necessary.
