## Report: Importing and Cleaning NASA Meteorite Data

This project focuses on importing and cleaning a CSV file containing historical meteorite landing data into a SQLite database named `meteorites.db`. This task is part of the problem set available at [https://cs50.harvard.edu/sql/2024/psets/3/meteorites/](https://cs50.harvard.edu/sql/2024/psets/3/meteorites/). The goal is to transform the raw data into a structured and clean format suitable for analysis by NASA engineers. The process involves creating a new table named `meteorites` within the database and populating it with a subset of the columns from the CSV file, while also addressing data quality issues such as empty values, decimal precision, and irrelevant entries.

## Database Schema: `meteorites.db` - Table: `meteorites`

The `meteorites` table within the `meteorites.db` database will be structured to store cleaned information about meteorite landings. The table will contain the following columns, derived and transformed from the source CSV file:

| Column Name | Data Type | Description                                                                |
|-------------|-----------|----------------------------------------------------------------------------|
| `id`        | INTEGER   | A unique identifier for each meteorite, assigned based on the sorted data. |
| `name`      | TEXT      | The given name of the meteorite.                                           |
| `class`     | TEXT      | The classification of the meteorite based on based on the [traditional classification scheme](https://en.wikipedia.org/wiki/Meteorite_classification)..                             |
| `mass`      | REAL      | The weight of the meteorite in grams.                                      |
| `discovery` | TEXT      | Indicates whether the meteorite "Fell" or was "Found".                     |
| `year`      | INTEGER   | The year in which the meteorite was discovered.                            |
| `lat`       | REAL      | The latitude at which the meteorite landed.                                |
| `long`      | REAL      | The longitude at which the meteorite landed.                               |


## Data Import and Cleaning Process

This notebook will contain a series SQLite commands to perform the following steps:

1.  **Create the `meteorites` table:** Define the schema of the new table with the specified columns and data types.
2.  **Import data from `meteorites.csv`:** Read the data from the CSV file.
3.  **Clean and transform the data:**
    * Handle empty values by converting them to `NULL`.
    * Round the `mass`, `lat`, and `long` columns to the nearest hundredths place.
    * Filter out meteorites where the `nametype` is "Relict".
4.  **Sort the data:** Order the remaining meteorites first by `year` (ascending) and then by `name` (ascending).
5.  **Assign new IDs:** Create a new `id` column and assign sequential integer values starting from 1 based on the sorted order.
6.  **Insert the cleaned and transformed data** into the `meteorites` table.

In [12]:
import sqlite3
import pandas as pd

### Database Initialization

This section handles the initial database setup and data preparation.

In [13]:
connection = sqlite3.connect("data_bases/meteorites.db")
cursor = connection.cursor()
print("Connected to the database successfully!")

Connected to the database successfully!


### Table Management
Performing cleanup of existing structures:
- Removing any existing tables to ensure fresh start
- Preventing potential conflicts with previous data


In [14]:
cursor.execute('DROP TABLE IF EXISTS meteorites_temp;')
cursor.execute('DROP TABLE IF EXISTS meteorites;')
df = pd.read_csv("data_bases/meteorites.csv")

### Data Pre-processing
Setting up temporary structures for data cleaning:
1. Converting empty values to SQL `NULL`
2. Creating temporary table `meteorites_temp` for data staging
3. Ensuring data consistency between pandas `None` and SQL `NULL` values

In [15]:
# Clean columns by replacing empty strings with NaN
df = df.replace({
    "mass": {"": pd.NA},
    "year": {"": pd.NA},
    "lat": {"": pd.NA},
    "long": {"": pd.NA}
})

# Write temp table to the database
df.to_sql("meteorites_temp", connection, index=False)

45716

In [16]:
query = """
SELECT *
FROM meteorites_temp
WHERE "mass" IS NULL
LIMIT 5
;
"""

df_0 = pd.read_sql_query(query, connection)

df_0

Unnamed: 0,name,id,nametype,class,mass,discovery,year,lat,long
0,Aire-sur-la-Lys,425,Valid,Unknown,,Fell,1769.0,50.66667,2.33333
1,Angers,2301,Valid,L6,,Fell,1822.0,47.46667,-0.55
2,Barcelona (stone),4944,Valid,OC,,Fell,1704.0,41.36667,2.16667
3,Belville,5009,Valid,OC,,Fell,1937.0,-32.33333,-64.86667
4,Castel Berardenga,5292,Valid,Stone-uncl,,Fell,1791.0,43.35,11.5


### Schema Implementation
Creating the `meteorites` table with proper:
- Data types
- Constraints
- Field specifications

In [17]:
cursor.execute("""
CREATE TABLE meteorites (
    id INTEGER PRIMARY KEY,
    name TEXT,
    class TEXT,
    mass NUMERIC,
    discovery TEXT,
    year NUMERIC,
    lat NUMERIC,
    long NUMERIC
);
""")

<sqlite3.Cursor at 0x712d23b5cec0>

### Data Migration
Transferring data from temporary to final structure:
- Using cleaned data from `meteorites_temp`
- Applying filters and transformations
- Ensuring data integrity during insertion

In [18]:
cursor.execute("""
INSERT INTO meteorites (name, class, mass, discovery, year, lat, long)
SELECT name, class, ROUND(mass, 2), discovery, year,
       ROUND(lat, 2), ROUND(long, 2)
FROM meteorites_temp
WHERE nametype != 'Relict'
ORDER BY year, name;
""")

# Commit and close
connection.commit()
connection.close()