Star type prediction
In this notebook , exploratory data analysis has been done on the star dataset and then star type has been predicted using Random Forest Classifier.
The data for this notebook as been taken from https://www.kaggle.com/deepu1109/star-dataset

If you like this notebook, kindly consider giving it an upvote


Before we get into the exploratory analyis , let's have a look at the type of stars based on different categories...

Spectral class
Spectral class of a star is the measure of amount of energy radiated by it . The series goes in decreasing order from "O" class (hottest) to "M" (coolest).
O > B > A > F > G > K > M There is a good way to remember it -> O Be A Fine Gal Kiss Me (just for fun (; )



Quick Question : Do you know what is the spectral class of our sun ?
Answer at the end of the notebook.
Size
Stars on the basis of their size have been classified into the following main categories :

Red Dwarf
White Dwarf
Brown Dwarf
Main Sequence
Supergiants
Hypergiants
All these categories have their special place in the Hertzsprung Russell diagram which compares temperature and luminoscity of these stars.

image.png

Other common parameters that will be used in this notebook are :

Absolute Temperature (in K)
Relative Luminosity (L/Lo)
Relative Radius (R/Ro)
Absolute Magnitude (Mv)
Star Color (white,Red,Blue,Yellow,yellow-orange etc)
Now let's start the preparation of our EDA on the given dataset

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score,auc, accuracy_score
from sklearn import metrics, preprocessing
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Formatage des données

In [2]:
data = pd.read_csv("../input/star-dataset/6 class csv.csv")
data.head()

In [3]:
data.shape

In [4]:
stars = pd.read_csv("../input/star-dataset/6 class csv.csv")

fig , ax = plt.subplots(figsize = (13,10))

R = stars[stars["Star type"] == 0]
B = stars[stars["Star type"] == 1]
W = stars[stars["Star type"] == 2]
M = stars[stars["Star type"] == 3]
S = stars[stars["Star type"] == 4]
H = stars[stars["Star type"] == 5]

ax.scatter(np.log(R["Temperature (K)"]), np.log(R["Luminosity(L/Lo)"]),5, label = 'Red dwarfs')
ax.scatter(np.log(B["Temperature (K)"]), np.log(B["Luminosity(L/Lo)"]),7, label = 'Brown dwarfs')
ax.scatter(np.log(W["Temperature (K)"]), np.log(W["Luminosity(L/Lo)"]),10, label = 'White dwarfs')
ax.scatter(np.log(M["Temperature (K)"]), np.log(M["Luminosity(L/Lo)"]),15, label = 'Main sequence stars')
ax.scatter(np.log(S["Temperature (K)"]), np.log(S["Luminosity(L/Lo)"]),30, label = 'Supergiants')
ax.scatter(np.log(H["Temperature (K)"]), np.log(H["Luminosity(L/Lo)"]),50, label = 'Hypergiants')

ax.invert_xaxis()
ax.legend()
plt.xlabel("Log Temperature")
plt.ylabel("Log Luminosity")
ax.grid()
ax.set_facecolor("black")

Have we got well-balanced data?


In [5]:
sns.set(style="darkgrid")
stars_types = pd.DataFrame(data['Star type'].value_counts().sort_values(ascending=False))
plt.figure(figsize=(15,5))
ax = sns.barplot(x = stars_types.index, y = 'Star type' , data = stars_types, palette='pastel')

Explore color types:

In [6]:
stars_color = pd.DataFrame(data['Star color'].value_counts().sort_values(ascending=False))
plt.figure(figsize=(15,5))
ax = sns.barplot(x = stars_color.index, y = 'Star color' , data = stars_color, palette='pastel')
ax = ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

In [7]:
stars_data = {
    'temperature': data['Temperature (K)'],
    'luminosity': data['Luminosity(L/Lo)'],
    'radius': data['Radius(R/Ro)'],
    'absolute_magnitude': data['Absolute magnitude(Mv)'],
    'star_type': data['Star type'],
    'star_color': data['Star color'],
    'spectral_class': data['Spectral Class']
}
stars_data = pd.DataFrame.from_dict(stars_data)
stars_data['star_type'] = stars_data['star_type'].astype('category').cat.codes
stars_data['star_color'] = stars_data['star_color'].astype('category').cat.codes
stars_data['spectral_class'] = stars_data['spectral_class'].astype('category').cat.codes

corr = stars_data.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(10, 10))
cmap = sns.diverging_palette(200, 21, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()
corr

And we can see the most high dependenses between:

temp and radius;
star_type and star_color, spectral_class;
star_color and radius, spectral_class.

In [8]:
f, axes = plt.subplots(2, 2, figsize=(10, 10))
sns.despine(left=True)

sns.distplot(stars_data['temperature'], color='b', ax=axes[0, 0])
sns.distplot(stars_data['luminosity'], color='m', ax=axes[0, 1])
sns.distplot(stars_data['radius'], color='r', ax=axes[1, 0])
sns.distplot(stars_data['absolute_magnitude'], color='g', ax=axes[1, 1])
plt.setp(axes, yticks=[])
plt.tight_layout()

In [10]:
ax = sns.catplot(x = 'Star color', y = 'Temperature (K)', kind = "box", data = data, palette='pastel')
ax = ax.fig.set_size_inches(30, 5)

In [11]:
ax = sns.catplot(x = 'Star color', y = 'Luminosity(L/Lo)', kind = "box", data = data, palette='pastel')
ax = ax.fig.set_size_inches(30, 5)

In [12]:
ax = sns.catplot(x = 'Star color', y = 'Radius(R/Ro)', kind = "box", data = data, palette='pastel')
ax = ax.fig.set_size_inches(30, 5)

In [13]:
ax = sns.catplot(x = 'Star color', y = 'Absolute magnitude(Mv)', kind = "box", data = data, palette='pastel')
ax = ax.fig.set_size_inches(30, 5)

In [14]:
import plotly.express as px

fig = px.scatter(data, x="Temperature (K)", y="Luminosity(L/Lo)", size="Radius(R/Ro)", color="Star color",
           hover_name="Star type", log_x=True, size_max=60)
fig.show()

Explore spectral classes:

Ici, nous voulons convertir les données tel que Star color ou Spectral Class en chiffre au lieu de texte :

In [15]:
print(data['Star color'].unique())
print(data['Spectral Class'].unique())

In [16]:
data['Star color'] = data['Star color'].replace('Blue white', 'Blue White').replace('Blue-white', 'Blue White').replace('Blue white ', 'Blue White').replace('Blue-White', 'Blue White').replace('Blue ', 'Blue').replace('white', 'White').replace('Whitish', 'White').replace('Yellowish White', 'Yellow White').replace('yellow-white', 'Yellow White').replace('White-Yellow', 'Yellow White').replace('yellowish', 'Yellow').replace('Yellowish', 'Yellow').replace('Pale yellow orange', 'Yellow Orange').replace('Orange-Red', 'Orange Red')
# data["Star color"] = pd.factorize(data["Star color"])[0]
# data["Spectral Class"] = pd.factorize(data["Spectral Class"])[0]
le = preprocessing.LabelEncoder()
data['Spectral Class'] = le.fit_transform(list(data['Spectral Class']))
data['Star color'] = le.fit_transform(list(data['Star color']))
data.head()

In [17]:
print(data['Star color'].unique())
print(data['Spectral Class'].unique())

Ensuite, nous allons visualiser les différents types de de données que contient le dataset :

In [18]:
sns.pairplot(data, hue="Star type")

On peut voir ci dessus que certains aspects sont spécifiques au type de l'étoile, par exemple,il n'y a qu'un certain type d'étoile qui va au dessus de 40 000 degrès Kelvin.

On va regarder du côté de la matrice de corrélation:

In [19]:
corr_matrix = data.corr()
print(corr_matrix["Star type"])

In [21]:
columns = ["Absolute magnitude(Mv)","Radius(R/Ro)", "Luminosity(L/Lo)", "Temperature (K)", "Spectral Class"]
X = data[columns].values
y = data["Star type"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Je vais me séparer de Star color et de Spectral Class pour ne garder uniquement les éléments qui sont le plus corrélés.

# Etude : prédictions

### 1. Logistic Regression

In [22]:
lr = LogisticRegression()
lr.fit(X_train,y_train)

In [23]:
y_lr = lr.predict(X_test)
print(accuracy_score(y_test,y_lr))

On voit vite que c'est nul 

###  2. Forêts Aléatoires

In [24]:
rfc = RandomForestClassifier(n_estimators=100)

rfc.fit(X_train,y_train)

In [25]:
rfc_preds = rfc.predict(X_test)

print("Précision:", metrics.accuracy_score(y_test, rfc_preds))

On obtient une précision de 100% (comme quoi les méthodes utilisées peuvent avoir un impact complétement différent sur les résultats.

### 3. SVC ( Support Vector Classifier ) 

In [26]:
svc = SVC(kernel='linear')

# On entraine le modèle
svc.fit(X_train,y_train)

In [27]:
svc_preds = svc.predict(X_test)

print("Précision:", metrics.accuracy_score(y_test, svc_preds))

### 4. K-Nearest Neighbour Classifier

In [28]:
knc = KNeighborsClassifier()
knc.fit(X_train,y_train)

In [29]:
knc_preds = knc.predict(X_test)

print("Précision:", metrics.accuracy_score(y_test, knc_preds))

In [30]:
knc_preds

In [42]:
import numpy as np

from catboost import CatBoostClassifier, Pool

# initialize data
#train_data = np.random.randint(0,
#                               100, 
#                               size=(100, 10))

#train_labels = np.random.randint(0,
#                                 2,
#                                 size=(100))

#test_data = catboost_pool = Pool(train_data, 
#                                 train_labels)

model = CatBoostClassifier(iterations=100,
                           depth=3,
                           learning_rate=0.5,
                           loss_function='MultiClass',
                           verbose=True)
# train the model
model.fit(X_train,y_train)
# make the prediction using the resulting model
preds_class = model.predict(X_test)
preds_proba = model.predict_proba(X_test)
#print("class = ", preds_class)
#print("proba = ", preds_proba)

print("Précision:", metrics.accuracy_score(y_test, preds_class))


In [None]:
preds_class