Построить модель прогнозирования значений в столбце RainTomorrow (классификация) по заданной таблице наблюдений за погодой в Австралии.

В качестве тестовых данных выбрат последние два месяца наблюдений за погодой во всех точках Австралии.

Скорее всего, будет необходимо строить разные модели под разные точки наблюдения ввиду разницы в климате под влиянием различных прибрежных зон.

**Рекомендации**:

1. Скорее всего, для начала стоит построить простую модель по числовым наблюдениям, проверив данные на пропуски и очистки дней с пропусками в числовых наблюдениях.

2. Даты можно превратить в несколько новых колонок (День, Месяц, День недели, День месяца, День года).

3. Для улучшения качества модели воспользоваться категориальными признаками, превратив их в Дамми-переменные

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.tree as tree
import seaborn as sns
sns.set()

from sklearn.metrics import r2_score, f1_score, confusion_matrix, classification_report
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import GridSearchCV, train_test_split

- Date: The date of observation 

- Location: The common name of the location of the weather station

- MinTemp: The minimum temperature in degrees celsius

- MaxTemp: The maximum temperature in degrees celsius

- RainFall: The amount of rainfall recorded for the day in mm

- Evaporation: The so-called Class A pan evaporation (mm) in the 24 hours to 9am

- Sunshine: The number of hours of bright sunshine in the day.

- WindGustDir: The direction of the strongest wind gust in the 24 hours to midnight

- WindGustSpeed: The speed (km/h) of the strongest wind gust in the 24 hours to midnight

- WindDir9am: Direction of the wind at 9am

- WindDir3pm: Direction of the wind at 3pm

- WindSpeed9am: Wind speed (km/hr) averaged over 10 minutes prior to 9am

- WindSpeed3pm: Wind speed (km/hr) averaged over 10 minutes prior to 3pm

- Humidity9am: Humidity (percent) at 9am

- Humidity3pm: Humidity (percent) at 9am

- Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9am

- Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3pm

- Cloud9am: Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many

- Cloud3pm: Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm. See Cload9am for a description of the values

- Temp9am: Temperature (degrees C) at 9am

- Temp3pm: Temperature (degrees C) at 3pm

- RainToday: Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0

- RainTomorrow: The amount of next day rain in mm. Used to create response variable RainTomorrow. A kind of measure of the "risk".

In [None]:
# Импорт данных
weather_AUS = pd.read_csv("https://raw.githubusercontent.com/qwerty29544/RpracticeBook/master/2Data/01FlatTables/weatherAUS.csv",
                          parse_dates = ["Date"]).drop(["RISK_MM", 'Evaporation', 'Sunshine'], axis=1)

In [None]:
names = weather_AUS.columns
weather_AUS.isnull().sum()

Date                 0
Location             0
MinTemp           3546
MaxTemp           3359
Rainfall          6362
WindGustDir      16330
WindGustSpeed    16223
WindDir9am       17040
WindDir3pm        9238
WindSpeed9am      4160
WindSpeed3pm      7731
Humidity9am       4849
Humidity3pm       8931
Pressure9am      24024
Pressure3pm      24005
Cloud9am         93670
Cloud3pm         99480
Temp9am           3624
Temp3pm           7783
RainToday         6362
RainTomorrow      6361
dtype: int64

In [None]:
set(list(weather_AUS['Location'].values))

{'Adelaide',
 'Albany',
 'Albury',
 'AliceSprings',
 'BadgerysCreek',
 'Ballarat',
 'Bendigo',
 'Brisbane',
 'Cairns',
 'Canberra',
 'Cobar',
 'CoffsHarbour',
 'Dartmoor',
 'Darwin',
 'GoldCoast',
 'Hobart',
 'Katherine',
 'Launceston',
 'Melbourne',
 'MelbourneAirport',
 'Mildura',
 'Moree',
 'MountGambier',
 'MountGinini',
 'Newcastle',
 'Nhil',
 'NorahHead',
 'NorfolkIsland',
 'Nuriootpa',
 'PearceRAAF',
 'Penrith',
 'Perth',
 'PerthAirport',
 'Portland',
 'Richmond',
 'Sale',
 'SalmonGums',
 'Sydney',
 'SydneyAirport',
 'Townsville',
 'Tuggeranong',
 'Uluru',
 'WaggaWagga',
 'Walpole',
 'Watsonia',
 'Williamtown',
 'Witchcliffe',
 'Wollongong',
 'Woomera'}

In [None]:
for i in names[2:]:
  if i not in ['WindGustDir', 'WindDir9am', 'RainToday', 'RainTomorrow', 'WindDir3pm']:
    weather_AUS[f'{i}'] = np.where(weather_AUS[f'{i}'].isnull(), weather_AUS[f'{i}'].median(), weather_AUS[f'{i}'])
  else: 
    weather_AUS[f'{i}'] = weather_AUS[f'{i}'].fillna(method='backfill')
weather_AUS.Date = pd.to_datetime(weather_AUS.Date)
weather_AUS.reset_index(inplace=True)
weather_AUS['Year'] = weather_AUS['Date'].dt.year
weather_AUS['Mounth'] = weather_AUS['Date'].dt.month
weather_AUS['Day'] = weather_AUS['Date'].dt.day
weather_AUS = weather_AUS.drop(['Date', 'index'], axis=1)

In [None]:
def splitTrainTest(weather_AUS):
  weather_AUS.index = list(range(len(weather_AUS['Mounth'])))
  X_test = weather_AUS[((weather_AUS['Year'] == weather_AUS['Year'][len(weather_AUS['Year'])-1]) & (weather_AUS['Mounth'] > weather_AUS['Mounth'][len(weather_AUS['Mounth'])-1]-2))]
  X_train = weather_AUS[((weather_AUS['Year'] <= weather_AUS['Year'][len(weather_AUS['Year'])-1]) & (weather_AUS['Mounth'] <= weather_AUS['Mounth'][len(weather_AUS['Mounth'])-1]-2))] 
  Y_train = X_train['RainTomorrow'] 
  Y_test = X_test['RainTomorrow'] 
  X_test = X_test.drop('RainTomorrow', axis=1)
  X_train = X_train.drop('RainTomorrow', axis=1)
  list_difference = [element for element in list(X_train.columns) if element not in list(X_test.columns)] 
  X_train = X_train.drop(list_difference, axis=1)
  return X_train, X_test, Y_train, Y_test

In [None]:
d = dict(tuple(weather_AUS.groupby('Location')))

In [None]:
X_train, X_test, Y_train, Y_test = splitTrainTest(d['Cairns'])
X_test, X_train, Y_test, Y_train = pd.get_dummies(X_test), pd.get_dummies(X_train), pd.get_dummies(Y_test),  pd.get_dummies(Y_train)
list_difference = [element for element in list(X_train.columns) if element not in list(X_test.columns)] 
X_train = X_train.drop(list_difference, axis=1)
cols = X_train.columns
for i in cols:
  X_train[i] = pd.to_numeric(X_train[i], downcast='float')
  X_test[i] = pd.to_numeric(X_test[i], downcast='float')

cols = Y_train.select_dtypes(exclude=['float']).columns
Y_train[cols] = Y_train[cols].apply(pd.to_numeric, downcast='float', errors='coerce')

cols = Y_test.select_dtypes(exclude=['float']).columns
Y_test[cols] = Y_test[cols].apply(pd.to_numeric, downcast='float', errors='coerce')
X_test.shape, X_train.shape

((2867, 114), (108343, 114))

In [None]:
X_train.dtypes

MinTemp            float32
MaxTemp            float32
Rainfall           float32
WindGustSpeed      float32
WindSpeed9am       float32
WindSpeed3pm       float32
Humidity9am        float32
Humidity3pm        float32
Pressure9am        float32
Pressure3pm        float32
Cloud9am           float32
Cloud3pm           float32
Temp9am            float32
Temp3pm            float32
Year               float32
Mounth             float32
Day                float32
Location_Uluru       uint8
WindGustDir_E        uint8
WindGustDir_ENE      uint8
WindGustDir_ESE      uint8
WindGustDir_N        uint8
WindGustDir_NNW      uint8
WindGustDir_NW       uint8
WindGustDir_S        uint8
WindGustDir_SE       uint8
WindGustDir_SSE      uint8
WindGustDir_SSW      uint8
WindGustDir_W        uint8
WindGustDir_WNW      uint8
WindDir9am_E         uint8
WindDir9am_ENE       uint8
WindDir9am_ESE       uint8
WindDir9am_N         uint8
WindDir9am_NNW       uint8
WindDir9am_NW        uint8
WindDir9am_S         uint8
W

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

def train_using_gini(X_train, y_train):
    # Creating the classifier object
    clf_gini = tree.DecisionTreeClassifier(criterion = "gini",
            random_state = 100,max_depth=3, min_samples_leaf=5)
    clf_gini.fit(X_train, y_train)
    return clf_gini

def prediction(X_test, clf_object):
    return clf_object.predict(X_test)
      
# Function to calculate accuracy
def cal_accuracy(y_test, y_pred):
    print (accuracy_score(y_test,y_pred)*100)
    print(classification_report(y_test, y_pred))

In [None]:
clf_gini = train_using_gini(X_train, Y_train)
print("Results Using Gini Index:")
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(Y_test, y_pred_gini), f1_score(Y_test, y_pred_gini, average="macro")

Results Using Gini Index:
79.14196023718172
              precision    recall  f1-score   support

           0       0.80      0.93      0.86      2006
           1       0.74      0.46      0.57       861

   micro avg       0.79      0.79      0.79      2867
   macro avg       0.77      0.70      0.72      2867
weighted avg       0.78      0.79      0.78      2867
 samples avg       0.79      0.79      0.79      2867



(None, 0.717165468328503)

In [None]:
model_DT_gscv = tree.DecisionTreeClassifier()
params_grid = {
    "ccp_alpha": [0.0],
    "class_weight": [None],
    "criterion": ['gini'],
    "max_depth": [12, 15, 18],
    "max_features": [None],
    "max_leaf_nodes": [None],
    "min_impurity_decrease": [0.0],
    "min_samples_leaf": [1, 5, 10],
    "min_samples_split": [2, 12, 22],
    'min_weight_fraction_leaf': [0.0], 
    'random_state': [None], 
    'splitter': ['best']
    }


grid_search_DT_clf = GridSearchCV(estimator=model_DT_gscv, 
                                  param_grid=params_grid, 
                                  scoring="f1_macro", 
                                  cv = 4)
grid_search_DT_clf.fit(np.array(X_train), np.array(Y_train))
preds_train = grid_search_DT_clf.predict(np.array(X_train))
print("Оценка классификации на обучении ", f1_score(Y_train, preds_train, average="macro"))
preds_DT_gscv = grid_search_DT_clf.predict(np.array(X_test))
f1_score(Y_test, preds_DT_gscv, average="macro")

Оценка классификации на обучении  0.7979903137707194


0.7505178760341387

In [None]:
model_RF_clf = ensemble.RandomForestClassifier()
model_RF_clf.fit(np.array(X_train), np.array(Y_train))
preds_RF_clf1 =  model_RF_clf.predict(np.array(X_train))
preds_RF_clf = model_RF_clf.predict(np.array(X_test))

f1_score(Y_train, preds_RF_clf1, average="macro"), f1_score(Y_test, preds_DT_gscv, average="macro")

(1.0, 0.7505178760341387)

In [None]:
# Распределение классов
weather_AUS["RainTomorrow"].value_counts(normalize=True)

No     0.774904
Yes    0.225096
Name: RainTomorrow, dtype: float64