### Notes from the mentor call

some clusters of next reimage may be dominant, which gives insights - groups with higher proportion with next reimage. 

finding a category that separates numeric into 2 sets, separate each into 2 based on some numeric feature for chi-square

daily average ram - is there difference per different clusters of ram amount, if it's statistically significant per label. prove to show that i'm using the variables for good reason, that there's a correlation with the responce. 

wilcoxon ranksum test (doesn't assume any distribution), look into that just in case - hypothesis test for unknown distribution. but for the general it's t-test

multiple hypothesis testing in my case - when i pick an alpha, then i do a correction (read about it), it's multiplication, and it will be that something 

multi-class classification - one-vs-all strategy (look into that), since i have 0 to 6 zones. fitting one against the rest, doing it 5 times, and the highest probability becomes the zone. keras should have that. apply general first - logistic, random forest, then try keras. try models on a much smaller subset (100-150k, random sampling, stratified sampling (should be same proportion as larger dataset for all features), then fit the final model to the big dataset to check. 

## Prediction of the device health based on the hardware and software performance

**Problem**: Companies are losing a significant amount of employee’s time due to unexpected hardware crashes and the necessity to reimage or replace a device. A solution that would predict a coming hardware crash and allow time to address the issue beforehand and without any disruption to work could significantly increase the workforce productivity and save millions of dollars.

**Client**: Companies of any size and individual users.

**Data**: The PC test performance data that includes information on the physical/virtual memory, RAM, and software errors. The data set has 5,609,148 rows and 12 features (input variables) stored in a .csv file.

Description of the variables:

* `pcid`: Device ID

* `date`: The day the measurement was taken. The missing values are imputed using the forward-fill and back-fill propagation, as well as taking an average approaches.

* `free_physical_memory`: How much free physical memory was available on a device on a given date (in Mb)

* `free_virtual_memory`: How much free virtual memory was available on a device on a given date (in Mb)

* `daily_average_ram`: Percentage of RAM in use per day (sampled each minute and averaged)

* `daily_std_dev_ram`: The standard deviation from `daily_average_ram` (sampled each minute and averaged)

* `windows_events_count`: How many Windows events occured on a given day (Error and Critical only)

* `has_bios_error`: Whether a device reported an error on a given day

* `driver_crash_count`: How many driver crashes occured on a device on a given day

* `average_time_since_last_boot`: Time since the last Windows start (in ms)

* `next_reimage`: The date and time when a device is expected to have the next OS reimage. This date is defined retrospectively by capturing the actual reimage date later on.
    
* `zone`: The risk zone of the device, indicating when a device will require a reimage. This variable is the label we'll be trying to predict. This variable is defined by the `next_reimage` variable as per below:
    
    - Zone 0: Device is healthy (the date of next reimage is NaN)
    - Zone 1: Device will have a reimage in the next 0-10 days
    - Zone 2: Device will have a reimage in the next 11-20 days
    - Zone 3: Device will have a reimage in the next 21-30 days
    - Zone 4: Device will have a reimage in the next 31-40 days
    - Zone 5: Device will have a reimage in the next 41-50 days
    - Zone 6: Device will have a reimage in the next 51-60 days

**Method**: The expected result of the project is to predict the timeline of the expected necessary re-image of a PC. A number of classification methods will be used to choose the best method for predicting the `zone` variable.

**Deliverables**: The outcome of the project will be presented in a form of Jupyter notebook, as well as the blog post on Medium.


### Step 1 - Exploratory Data Analysis

Loading required libraries and the dataset:

In [80]:
#Pandas for dataframes
import pandas as pd
#Changing default display option to display all columns
pd.set_option('display.max_columns', 21)

#Numpy for numerical computing
import numpy as np

#Matplotlib for visualization
import matplotlib.pyplot as plt

#Display plots in the notebook
%matplotlib inline 

#Seaborn for easier visualization
import seaborn as sns

#Stats package for statistical analysis
from scipy import stats

#Machine learning packages
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split 
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, roc_auc_score, auc, accuracy_score, confusion_matrix, classification_report

In [81]:
#Loading the data set
df = pd.read_csv('/Users/abdarabdar/Documents/sw_health_raw_data.csv', parse_dates=True)

**Parameters of the dataset**

Let's look at the overall characteristics of the dataset, starting with the dataset shape, number and types of variables. 

In [82]:
#Dataframe dimensions
df.shape

(5609148, 12)

In [83]:
#Types of variables
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5609148 entries, 0 to 5609147
Data columns (total 12 columns):
zone                            int64
next_reimage                    object
date                            object
free_physical_memory            float64
free_virtual_memory             float64
daily_average_ram               float64
daily_std_dev_ram               float64
windows_events_count            float64
has_bios_error                  int64
driver_crash_count              float64
average_time_since_last_boot    float64
pcid                            object
dtypes: float64(7), int64(2), object(3)
memory usage: 513.5+ MB


Most of the variables are numeric. Let's look at the first and last 5 rows of the data:

In [84]:
df.head()

Unnamed: 0,zone,next_reimage,date,free_physical_memory,free_virtual_memory,daily_average_ram,daily_std_dev_ram,windows_events_count,has_bios_error,driver_crash_count,average_time_since_last_boot,pcid
0,0,,2017-12-22,3312.5,3693.0,56.735933,1.85166,1.0,0,0.0,1519691.0,61d1ce206fb0d40117ad5a762b86972f51c75de8ebd6e1...
1,0,,2017-12-22,3699.0,8911.0,49.706861,2.353272,1.0,1,0.0,18339.04,7b5a90c8c7f13a7f1aa96e0b65e013e75e191e34b3296d...
2,0,,2017-12-22,1787.0,2423.0,77.401547,1.498973,1.0,0,0.0,2935453.0,2b1b3b12fb5f7b300d1ab3de6cb209cf279795ffe9b5e3...
3,0,,2017-12-22,10598.0,13119.0,35.569576,1.609778,1.0,0,0.0,7041.347,9d497b6b17459fccedfd37aeb2fd1789e01ec6214a748a...
4,0,,2017-12-22,5060.0,5968.0,34.740309,1.634479,1.0,0,0.0,521431.8,453035355e9e7930610f6866c5e63e225fb7e61c1ecfea...


In [85]:
df.tail()

Unnamed: 0,zone,next_reimage,date,free_physical_memory,free_virtual_memory,daily_average_ram,daily_std_dev_ram,windows_events_count,has_bios_error,driver_crash_count,average_time_since_last_boot,pcid
5609143,0,,2017-11-27,3736.0,11865.0,75.923269,7.046765,1.0,0,0.0,2234844.0,00368a64efce538e7a43fabcfbe6bdfac2a316dc7074a3...
5609144,0,,2017-11-27,8602.0,10341.0,45.273042,2.765385,1.0,0,0.0,608238.2,479b80338e81b700b0210ec67d5f0bde9a80f53e5dc493...
5609145,0,,2017-11-27,4752.0,6172.0,43.395437,2.335085,0.0,0,19.0,646596.2,cc9603c452df57f54db0e3987a5ec88e275596e98cdcf1...
5609146,0,,2017-11-27,5085.0,6518.0,57.548151,3.355863,1.0,0,20.0,1470173.0,6da29dfe3ddb6af696ec5321a129338c9d2547ca95374e...
5609147,0,,2017-11-27,4563.0,5660.0,43.239631,3.294113,1.0,1,5.0,349.7002,e54c81817301edde8669a8eba2bd9a72dbcf6b0d5a2c32...


The data on both ends look consistent, and there are no obvious errors. There are NaNs in the 'next_reimage' column, which correspond to zone = 0. Since the 'next_reimage' defines the 'zone' variable, we don't need to keep both in the data set and can keep the 'zone' variable only for the purposes of predictive model building.

Let's now check if the variables have any missing values:

In [86]:
#Checking for NaNs
for i in df.columns:
    print(i, ": ", df.loc[:,i].isnull().values.any())

zone :  False
next_reimage :  True
date :  False
free_physical_memory :  False
free_virtual_memory :  False
daily_average_ram :  False
daily_std_dev_ram :  False
windows_events_count :  False
has_bios_error :  False
driver_crash_count :  False
average_time_since_last_boot :  False
pcid :  False


There are no missing values in the data except the 'next_reimage' column, which is expected. The `date` variable was imputed beforehand, as mentioned before, so there are no missing values as well.

Before exploring the distributions of the variables, let's do some cleaning. We'll convert the 'zone' variable into category and 'date' into datetime object and drop the 'next_reimage' variable.

In [93]:
#Cleaning the data set
df_cleaned = df.copy()
df_cleaned['zone'] = df_cleaned['zone'].astype('category')
df_cleaned['date'] = pd.to_datetime(df_cleaned.date)
df_cleaned = df_cleaned.drop('next_reimage', axis=1)

Unnamed: 0,zone,date,free_physical_memory,free_virtual_memory,daily_average_ram,daily_std_dev_ram,windows_events_count,has_bios_error,driver_crash_count,average_time_since_last_boot,pcid
0,0,2017-12-22,3312.5,3693.0,56.735933,1.85166,1.0,0,0.0,1519691.0,61d1ce206fb0d40117ad5a762b86972f51c75de8ebd6e1...
1,0,2017-12-22,3699.0,8911.0,49.706861,2.353272,1.0,1,0.0,18339.04,7b5a90c8c7f13a7f1aa96e0b65e013e75e191e34b3296d...
2,0,2017-12-22,1787.0,2423.0,77.401547,1.498973,1.0,0,0.0,2935453.0,2b1b3b12fb5f7b300d1ab3de6cb209cf279795ffe9b5e3...
3,0,2017-12-22,10598.0,13119.0,35.569576,1.609778,1.0,0,0.0,7041.347,9d497b6b17459fccedfd37aeb2fd1789e01ec6214a748a...
4,0,2017-12-22,5060.0,5968.0,34.740309,1.634479,1.0,0,0.0,521431.8,453035355e9e7930610f6866c5e63e225fb7e61c1ecfea...


Next, let's look at the distributions of the numerical variables in the data set:

In [96]:
#Obtaining the distributions of the numerical variables
df_cleaned.describe()

Unnamed: 0,free_physical_memory,free_virtual_memory,daily_average_ram,daily_std_dev_ram,windows_events_count,has_bios_error,driver_crash_count,average_time_since_last_boot
count,5609148.0,5609148.0,5609148.0,5609148.0,5609148.0,5609148.0,5609148.0,5609148.0
mean,5301.039,9390.899,50.84281,2.291617,0.948564,0.222095,3.561113,-1684183.0
std,5685.789,9375.644,14.07656,4.89127,0.2208853,0.4156547,6.910196,744488100.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-251887500000.0
25%,3112.0,4594.0,42.80038,1.387752,1.0,0.0,0.0,36916.86
50%,3969.0,6805.5,50.34052,2.11089,1.0,0.0,0.0,224184.5
75%,4906.0,11594.0,59.55315,2.931696,1.0,0.0,5.0,653792.5
max,256155.0,381930.0,1418.681,7412.071,1.0,1.0,608.0,1199125000.0


The summary highlights several interesting moments. First of all, variables 'free_physical_memory', 'free_virtual_memory' and 'daily_average_ram' all have minimum values of zero, which doesn't make sense. Whatever the system's performance may be, the minimum for those values can't be zero in real life. Next, 'average_time_since_last_boot' has negative values, which also doesn't make sense. The maximum value for 'daily_average_ram' is 1418.6%, which can't be true. 

Let's look at the number of instances where such cases take place:

In [115]:
#Number of instances for 'free_physical_memory'
df_cleaned[df_cleaned['free_physical_memory'] == 0].count()

zone                            78
date                            78
free_physical_memory            78
free_virtual_memory             78
daily_average_ram               78
daily_std_dev_ram               78
windows_events_count            78
has_bios_error                  78
driver_crash_count              78
average_time_since_last_boot    78
pcid                            78
dtype: int64

In [104]:
#Number of instances for 'free_virtual_memory'
df_cleaned[df_cleaned['free_virtual_memory'] == 0].count()

zone                            78
date                            78
free_physical_memory            78
free_virtual_memory             78
daily_average_ram               78
daily_std_dev_ram               78
windows_events_count            78
has_bios_error                  78
driver_crash_count              78
average_time_since_last_boot    78
pcid                            78
dtype: int64

In [109]:
#Number of instances for 'daily_average_ram'
df_cleaned[df_cleaned['daily_average_ram'] > 100].count()

zone                            20
date                            20
free_physical_memory            20
free_virtual_memory             20
daily_average_ram               20
daily_std_dev_ram               20
windows_events_count            20
has_bios_error                  20
driver_crash_count              20
average_time_since_last_boot    20
pcid                            20
dtype: int64

In [126]:
#Number of instances for 'average_time_since_last_boot'
df_cleaned[df_cleaned['average_time_since_last_boot'] <= 60000].count()

zone                            1589662
date                            1589662
free_physical_memory            1589660
free_virtual_memory             1589662
daily_average_ram               1589662
daily_std_dev_ram               1589662
windows_events_count            1589662
has_bios_error                  1589662
driver_crash_count              1589662
average_time_since_last_boot    1589662
pcid                            1589662
dtype: int64

All the above cases represent a tiny fraction of the data set and can be dropped. Let's drop those instances and ..

In [None]:
#Dropping the erroneous instances
#df_cleaned.free_physical_memory.replace(0, np.nan, inplace=True)
#df_cleaned.free_virtual_memory.replace(0, np.nan, inplace=True)
df_cleaned.daily_average_ram.replace(df_cleaned.daily_average_ram > 100, np.nan, inplace=True)
#df_cleaned[df_cleaned['free_physical_memory'] == 0].count()

In [None]:
for i in df_cleaned.columns:
    print(i, ": ", df_cleaned.loc[:,i].isnull().values.any())

- Look at the value counts for zones - imbalance?
- convert zone into categorical
- Set date as index, look at the dynamics over time; value counts over months, days, etc - imbalance?
- daily average ram - is there difference per different clusters of ram amount?

In [94]:
min(df_cleaned['date'])
max(df_cleaned['date'])

Timestamp('2017-12-31 00:00:00')

In [95]:
min(df_cleaned['date'])

Timestamp('2017-10-01 00:00:00')

In [97]:
max(df_cleaned['daily_average_ram'])

1418.6811505279968