# Meteorite Landings prediction machine


*Author: Mikkel Husted
*Date: 08/09-2022
*Name: Meteorite Landings prediction machine


The purpose of this notebook is analyze the nasa meteorite landing CSV dataset, containing 45k+ recordings
of meteorite either impacting on earth with geolocation, impacts observed without geolocation, meteorites observed but no impact discovered. Through the analyze it is desired to make a correlation between year that the meteorites fell and geolocation, to start off with, in order to estimate where a meteorite is most likly to strike given a specific year. Further investigation of the dataset may reveale other correlations or patterns, that is of interrest. 

After a system has been described it is desired to implement the system, on a server which through an API ( mobile app or HTTP) can be configured and test with other settings. After which the system is required to send back a visual representation(to be defined) and a list containing the most important findings.

Specification:
--------------------
The system is required to prepare the dataset for a given algorithme
The system should show meaningful representations of the dataset, in order to gain insight.
The system should provide a RMSE/RME value for the effeciency of the algorithme.
(More to come)

End Specification:
--------------------------
The system should run on a external server(RPI?)
The server should listen for incoming HTTP connections, either through a mobile API or Webhook.
The server should save findings and/or datasets in an SQL server.
The server should pull/push data from/to the SQL server, depending on HTTP requests.

Considerations:
-------------------------
It could maybe be combined with Particles platform Argon, to collect data from a sensor/sonsors and feed this to the system. In such a case it would be required to go through the requirements from Specification, in order to fit the sytem to the new dataset.


In [1]:
#This part is borrowed from github/ageron/handsonml2.git 
#Check it out for more cool and amazing machine learning hands on eksamples
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Import functions for making trainin and test sets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
#Imports to prepare datasets 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
#Imports to for making pipeline and costume function to pipline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
#Imports to transform(prepare) datasets
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
#Imports to calculate errors
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
#Imports of Machine Learning algorithmens
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR 
from sklearn.linear_model import LinearRegression
#Imports to acess score of the algorithms
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from scipy import stats
#Import to export the system for later useage
import joblib

# Common imports
import pandas as pd
import numpy as np
import os
from pandas.plotting import scatter_matrix
import matplotlib.image as mpimg


# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [5]:
meteorite_data = pd.read_csv("/home/kultul/notebook/datasets/meteorite/Meteorite_Landings.csv")

In [7]:
meteorite_data.head(5)
# We see that we have 10 columns with different kinds of data. The last columns represent the same as columns 8 & 9 
# so we can drop it and rename 8 & 9. 

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,1880.0,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720.0,Fell,1951.0,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000.0,Fell,1952.0,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976.0,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780.0,Fell,1902.0,-33.16667,-64.95,"(-33.16667, -64.95)"


In [18]:
meteorite_data["latitude"] = meteorite_data["reclat"]
meteorite_data["longitude"] = meteorite_data["reclong"]
meteorite_data = meteorite_data.drop(axis='column', columns=['GeoLocation',"reclong", "reclat"])

In [19]:
meteorite_data.info()
# We see that most of the data are int and float, but a we have a few objects. These are generally not accepted by 
# most algorithmes, so we will have to transform them later.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45716 entries, 0 to 45715
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   name       45716 non-null  object 
 1   id         45716 non-null  int64  
 2   nametype   45716 non-null  object 
 3   recclass   45716 non-null  object 
 4   mass (g)   45585 non-null  float64
 5   fall       45716 non-null  object 
 6   year       45425 non-null  float64
 7   latitude   38401 non-null  float64
 8   longitude  38401 non-null  float64
dtypes: float64(4), int64(1), object(4)
memory usage: 3.1+ MB


Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation,latitude,longitude
0,Aachen,1,Valid,L5,21.0,Fell,1880.0,50.775,6.08333,"(50.775, 6.08333)",50.775,6.08333
1,Aarhus,2,Valid,H6,720.0,Fell,1951.0,56.18333,10.23333,"(56.18333, 10.23333)",56.18333,10.23333
2,Abee,6,Valid,EH4,107000.0,Fell,1952.0,54.21667,-113.0,"(54.21667, -113.0)",54.21667,-113.0
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976.0,16.88333,-99.9,"(16.88333, -99.9)",16.88333,-99.9
4,Achiras,370,Valid,L6,780.0,Fell,1902.0,-33.16667,-64.95,"(-33.16667, -64.95)",-33.16667,-64.95


In [50]:
meteorite_location = meteorite["GeoLocation"].combinepy()

In [51]:
meteorite_location.info()

<class 'pandas.core.series.Series'>
RangeIndex: 45716 entries, 0 to 45715
Series name: GeoLocation
Non-Null Count  Dtype 
--------------  ----- 
38401 non-null  object
dtypes: object(1)
memory usage: 357.3+ KB
