## MAGIC Gamma Telescope

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Dataset:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Donated by:
P. Savicky
Institute of Computer Science, AS of CR
Czech Republic
savicky '@' cs.cas.cz


In [24]:
df = pd.read_csv('magic04.data', header=None)
df.columns = ['fLength', 'fWidth', 'fSize', 'fConc', 'fConc1',
              'fAsym', 'fM3Long', 'fM3Trans', 'fAlpha', 'fDist', 'class']
df.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g


In [25]:
df['class'] = df['class'].map({'g': 1, 'h': 0})
df.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,1
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,1
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,1
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,1
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,1


In [26]:
train, validate, test = np.split(
    df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

  return bound(*args, **kwds)


In [27]:
# !conda install -c conda-forge imbalanced-learn --yes

In [28]:
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler

In [29]:
def scale_data(data,overSample=False):
    X = data.drop('class', axis=1)
    y = data['class']

    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    if overSample:
        ros = RandomOverSampler()
        X, y = ros.fit_resample(X, y)

    data = np.hstack((X, np.array(y).reshape(-1, 1)))

    return data, X, y

In [30]:
print(len(train[train['class'] == 1]), len(train[train['class'] == 0]))

7391 4021


In [31]:
train, X_train, y_train = scale_data(train, overSample=True)

In [32]:
print(len(train[train[:, -1] == 1]), len(train[train[:, -1] == 0]))

7391 7391


In [33]:
validate, X_validate, y_validate = scale_data(validate)
test, X_test, y_test = scale_data(test)

## k-nearest neighbors

In [34]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

In [35]:
knn_model = KNeighborsClassifier(n_neighbors=1)
knn_model.fit(X_train,y_train)

In [36]:
y_pred= knn_model.predict(X_test)

In [37]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.68      0.72      1319
           1       0.84      0.89      0.87      2485

    accuracy                           0.82      3804
   macro avg       0.81      0.79      0.79      3804
weighted avg       0.82      0.82      0.82      3804



## Naive Bayes

The Naive Bayes formula is:

$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} $

Here are the names for each term:

- $ P(A|B) $: **Posterior Probability** - the probability of event $A$ occurring given that event $B$ has occurred.
- $ P(B|A) $: **Likelihood** - the probability of event $B$ occurring given that event $A$ has occurred.
- $ P(A) $: **Prior Probability** - the probability of event $A$ occurring independently of event $B$.
- $ P(B) $: **Marginal Probability** or **Evidence** - the probability of event $B$ occurring independently of event $A$.

In the context of Naive Bayes classification:

- $ A $: The class label.
- $ B $: The feature vector or evidence.

So, in terms of classification:

- $ P(A|B) $: The probability of the class $A$ given the feature vector $B$.
- $ P(B|A) $: The probability of the feature vector $B$ given the class $A$.
- $ P(A) $: The prior probability of the class $A$.
- $ P(B) $: The prior probability of the feature vector $B$.


In [38]:
from sklearn.naive_bayes import GaussianNB

In [39]:
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

In [40]:
y_pred_nb = nb_model.predict(X_test)
print(classification_report(y_test, y_pred_nb))

              precision    recall  f1-score   support

           0       0.68      0.39      0.50      1319
           1       0.74      0.90      0.81      2485

    accuracy                           0.73      3804
   macro avg       0.71      0.65      0.65      3804
weighted avg       0.72      0.73      0.70      3804

