# Date Exploration (6min == 0.01)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

A good chocolate soufflé is decadent, delicious, and delicate. But, it's a challenge to prepare. When you pull a disappointingly deflated dessert out of the oven, you instinctively retrace your steps to identify at what point you went wrong. Bosch, one of the world's leading manufacturing companies, has an imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. Part of doing so is closely monitoring its parts as they progress through the manufacturing processes.

Because Bosch records data at every step along its assembly lines, they have the ability to apply advanced analytics to improve these manufacturing processes. However, the intricacies of the data and complexities of the production line pose problems for current methods.

In this competition, Bosch is challenging Kagglers to predict internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable Bosch to bring quality products at lower costs to the end user.

Submissions are evaluated on the Matthews correlation coefficient (MCC) between the predicted and the observed response. The MCC is given by:

MCC=(TP∗TN)−(FP∗FN)(TP+FP)(TP+FN)(TN+FP)(TN+FN)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√,


where TP is the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives.

Data Description

The data for this competition represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).

The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

On account of the large size of the dataset, we have separated the files by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken.

In addition to being one of the largest datasets (in terms of number of features) ever hosted on Kaggle, the ground truth for this competition is highly imbalanced. Together, these two attributes are expected to make this a challenging problem.

File descriptions

- train_numeric.csv - the training set numeric features (this file contains the 'Response' variable)
- test_numeric.csv - the test set numeric features (you must predict the 'Response' for these Ids)
- train_categorical.csv - the training set categorical features
- test_categorical.csv - the test set categorical features
- train_date.csv - the training set date features
- test_date.csv - the test set date features
- sample_submission.csv - a sample submission file in the correct format

Giai thich dataset:
1. Moi line se co mot so station, id cua tung station la rieng biet o tung line, do do chi can biet part nam o station nao la duoc roi, sau do se tu truy ra line nao
2. Moi part khi di qua station se duoc do cac feature tuong ung tai mot thoi gian cu the, do do column se theo ID (part), time (thoi gian do), L3_S36_f3939: part do vao thoi gian 87.2 dang o line 3, station 36, do feature 3939
3. Tai 1 station part se do nhieu feature khac nhau , thoi gian do tat ca feature cua part do tai station do la nhu nhau

Hi,
The timestamps were anonymized in this competition. My motivation was to understand how long is the test/ train period. This would allow to use some intuition for feature engineering.
My main question was: what does 0.01 time difference mean? Is it ms, s, m, hour, day? To answer that I tried to find periodic patterns using auto correlation.

I can't help you to answer the how to begin question. Fortunately there are plenty of forum topics with similar questions.
https://www.kaggle.com/forums/f/208/getting-started
Just google "kaggle start" or "kaggle begin"



In [None]:
# Let's check the min and max times for each station
def get_station_times(dates, withId=False):
    times = []
    cols = list(dates.columns)
    if 'Id' in cols:
        cols.remove('Id')
    for feature_name in cols:
        if withId:
            df = dates[['Id', feature_name]].copy()
            df.columns = ['Id', 'time']
        else:
            df = dates[[feature_name]].copy()
            df.columns = ['time']
        df['station'] = feature_name.split('_')[1][1:]
        df = df.dropna()
        times.append(df)
    return pd.concat(times)

In [None]:
train_date_part = pd.read_csv('../input/bosch-production-line-performance/train_date.csv.zip', nrows=10000)
print(train_date_part.shape)
train_date_part

In [None]:
train_date_part_reprocess=get_station_times(train_date_part, withId=True)
train_date_part_reprocess

In [None]:
train_numeric = pd.read_csv('../input/bosch-production-line-performance/train_numeric.csv.zip', nrows=10000)
print(train_numeric.shape)
train_numeric

In [None]:
len(set(train_numeric.Id))

In [None]:
train_categorical = pd.read_csv('../input/bosch-production-line-performance/train_categorical.csv.zip', nrows=10000)
print(train_categorical.shape)
train_categorical

In [None]:
# Let's check the min and max times for each station
# Tach tung cot ra, sau do dropna tung cot, roi append tat ca lai voi nhau
    
dates=train_date_part.copy()
withId=True
times = []
cols = list(dates.columns)
if 'Id' in cols:
    cols.remove('Id')
for feature_name in cols:
    if withId:
        df = dates[['Id', feature_name]].copy()
        df.columns = ['Id', 'time']
    else:
        df = dates[[feature_name]].copy()
        df.columns = ['time']
    df['line'] = feature_name.split('_')[0][1:]
    df['station'] = feature_name.split('_')[1][1:]
    df['feature_number'] = feature_name.split('_')[2][1:]
    df = df.dropna()
    print(df.shape)
    times.append(df)
    print(len(times))
station_times=pd.concat(times)
station_times

In [None]:
station_times=station_times.sort_values(by=['Id','station'])
station_times['line']=station_times['line'].astype('int64')
station_times['station']=station_times['station'].astype('int64')
station_times['feature_number']=station_times['feature_number'].astype('int64')
print(station_times.dtypes)
station_times

In [None]:
# How many station in each line ?
set(station_times[station_times.line==0].station)

In [None]:
# How many station in each time ?
print(set(station_times[station_times.time==82.24].station))
# How many part in each time ?
print(set(station_times[station_times.time==82.24].Id))

In [None]:
time=82.24
print('time: ',time)
station_=set(station_times[station_times.time==82.24].station)
for j in station_:
    print('station:', j)
    print('part: ',set(station_times[(station_times.time==time) & (station_times.station==j)].Id))
    print('feature_number: ',set(station_times[(station_times.time==time) & (station_times.station==j)].feature_number))
        

In [None]:
time=1379.78
print('time: ',time)
station_=set(station_times[station_times.time==82.24].station)
for j in station_:
    print('station:', j)
    print('part: ',set(station_times[(station_times.time==time) & (station_times.station==j)].Id))
    print('feature_number: ',set(station_times[(station_times.time==time) & (station_times.station==j)].feature_number))
        

In [None]:
set(station_times[station_times.line==1].station)

In [None]:
set(station_times[station_times.line==2].station)

In [None]:
set(station_times[station_times.line==3].station)

Each station is unique in each line, so no need to include line here

In [None]:
station_times.line.value_counts()

In [None]:
set(station_times.Id)

In [None]:
station_times.station.value_counts()

In [None]:
station_times.feature_number.value_counts()

In [None]:
part_id=6
part_filter=station_times[station_times.Id==part_id]
part_filter_line=set(part_filter.line)
part_filter_station=set(part_filter.station)
print('total_line: ',part_filter_line)
print('total_station: ',part_filter_station)

for i in part_filter_line:
    print('line:', i)
    for j in part_filter_station:
        print('station:', j)
        print('feature_number: ',set(part_filter[(part_filter.line==i) & (part_filter.station==j)].feature_number))
        print('time: ',set(part_filter[(part_filter.line==i) & (part_filter.station==j)].time))
        

In [None]:
def part_info(part_id):
    #part_id=part_id
    part_filter=station_times[station_times.Id==part_id]
    part_filter_line=set(part_filter.line)
    part_filter_station=set(part_filter.station)
    print('total_line: ',part_filter_line)
    print('total_station: ',part_filter_station)
    print('-'*60)

    for i in part_filter_line:
        print('-'*10)
        print('line:', i)
        for j in part_filter_station:
            print('station:', j)
            print('feature_number: ',set(part_filter[(part_filter.line==i) & (part_filter.station==j)].feature_number))
            print('time: ',set(part_filter[(part_filter.line==i) & (part_filter.station==j)].time))
        

In [None]:
part_info(120)

In [None]:
min_station_times = station_times.groupby(['Id', 'station']).min()['time']
max_station_times = station_times.groupby(['Id', 'station']).max()['time']

In [None]:
min_station_times

In [None]:
max_station_times

In [None]:
train_date_part

In [None]:
train_date_part.count()

In [None]:
# Read station times for train and test
date_cols = train_date_part.drop('Id', axis=1).count().reset_index().\
            sort_values(by=0, ascending=False)
date_cols

In [None]:
date_cols['station'] = date_cols['index'].apply(lambda s: s.split('_')[1])
date_cols

In [None]:
date_cols = date_cols.drop_duplicates('station', keep='first')['index'].tolist()
date_cols # selected features
# remove all duplicate station (with differtion feature measurment each station)

In [None]:
# applied these columns to all training data set
train_date = pd.read_csv('../input/bosch-production-line-performance/train_date.csv.zip', usecols=date_cols)
print(train_date.shape)
train_date

In [None]:
dates=train_date.copy()
withId=False
times = []
cols = list(dates.columns)
if 'Id' in cols:
    cols.remove('Id')
for feature_name in cols:
    if withId:
        df = dates[['Id', feature_name]].copy()
        df.columns = ['Id', 'time']
    else:
        df = dates[[feature_name]].copy()
        df.columns = ['time']
    df['line'] = feature_name.split('_')[0][1:]
    df['station'] = feature_name.split('_')[1][1:]
    df['feature_number'] = feature_name.split('_')[2][1:]
    df = df.dropna()
    #print(df.shape)
    times.append(df)
    #print(len(times))
train_station_times=pd.concat(times)
print(train_station_times.shape)
train_station_times
# Do chi giu lai 52 columns nen tong so dong 14 tr, khong qua nhieu, neu giu lai 1000 columns thi
# con so se rat lon

In [None]:
train_time_cnt = train_station_times.groupby('time').count()[['station']].reset_index()
train_time_cnt.columns = ['time', 'cnt']
print(train_time_cnt.shape)
train_time_cnt
# Loc thoi gian testing tung feature ung voi bao nhieu station.

In [None]:
test_date = pd.read_csv('../input/bosch-production-line-performance/test_date.csv.zip', usecols=date_cols)
print(test_date.shape)
test_date

In [None]:
dates=test_date.copy()
withId=False
times = []
cols = list(dates.columns)
if 'Id' in cols:
    cols.remove('Id')
for feature_name in cols:
    if withId:
        df = dates[['Id', feature_name]].copy()
        df.columns = ['Id', 'time']
    else:
        df = dates[[feature_name]].copy()
        df.columns = ['time']
    df['line'] = feature_name.split('_')[0][1:]
    df['station'] = feature_name.split('_')[1][1:]
    df['feature_number'] = feature_name.split('_')[2][1:]
    df = df.dropna()
    #print(df.shape)
    times.append(df)
    #print(len(times))
test_station_times=pd.concat(times)
print(test_station_times.shape)
test_station_times

In [None]:
test_time_cnt = test_station_times.groupby('time').count()[['station']].reset_index()
test_time_cnt.columns = ['time', 'cnt']
print(test_time_cnt.shape)

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure()
plt.plot(train_time_cnt['time'].values, train_time_cnt['cnt'].values, 'b.', alpha=0.1, label='train')
plt.plot(test_time_cnt['time'].values, test_time_cnt['cnt'].values, 'r.', alpha=0.1, label='test')
plt.title('Original date values')
plt.ylabel('Number of records')
plt.xlabel('Time')
fig.savefig('original_date_values.png', dpi=300)
plt.show()

In [None]:
print((train_time_cnt['time'].min(), train_time_cnt['time'].max()))
print((test_time_cnt['time'].min(), test_time_cnt['time'].max()))

A few observations:

- Train and test set has the same time period
- There is a clear periodic pattern
- The dates are transformed to 0 - 1718 with granularity of 0.01
- There is a gap in the middle

Could we figure out what does 0.01 mean? Let's check a few auto correlations!

In [None]:
time_ticks = np.arange(train_time_cnt['time'].min(), train_time_cnt['time'].max() + 0.01, 0.01)
time_ticks = pd.DataFrame({'time': time_ticks})
time_ticks

In [None]:
time_ticks = pd.merge(time_ticks, train_time_cnt, how='left', on='time')
time_ticks = time_ticks.fillna(0)
time_ticks
# Dem bao nhieu station lien quan toi specific time trong toan bo data set

In [None]:
# Autocorrelation
x = time_ticks['cnt'].values
max_lag = 8000
auto_corr_ks = range(1, max_lag)
auto_corr = np.array([1] + [np.corrcoef(x[:-k], x[k:])[0, 1] for k in auto_corr_ks])
auto_corr

In [None]:
print(len(auto_corr_ks))
print(auto_corr_ks)
print(len(x))
print(x)

In [None]:
k=1
print('k',k)
print(len(x[:-k]))
print(len(x[k:]))
print(x[:-k])
print(x[k:])
print('corrcoef: \n',np.corrcoef(x[:-k], x[k:]))
print('corrcoef: \n',np.corrcoef(x[:-k], x[k:])[0,1])
print('corrcoef: \n',np.array([1]+ np.corrcoef(x[:-k], x[k:])[0,1]))

In [None]:
k=3
print('k',k)
print(len(x[:-k]))
print(len(x[k:]))
print(x[:-k])
print(x[k:])
print('corrcoef: \n',np.corrcoef(x[:-k], x[k:]))
print('corrcoef: \n',np.corrcoef(x[:-k], x[k:])[0,1])
print('corrcoef: \n',np.array([1]+ np.corrcoef(x[:-k], x[k:])[0,1]))

Autocorrelation period 0.01

In [None]:
print(len(auto_corr))
auto_corr

In [None]:
fig = plt.figure()
plt.plot(auto_corr, 'k.', label='autocorrelation by 0.01')
plt.title('Train Sensor Time Auto-correlation')

Autocorrelation period 25

In [None]:
period = 25
auto_corr_ks = list(range(period, max_lag, period))
print(len(auto_corr_ks))
print(auto_corr_ks)

In [None]:
auto_corr = np.array([1] + [np.corrcoef(x[:-k], x[k:])[0, 1] for k in auto_corr_ks])
auto_corr

In [None]:
plt.plot([0] + auto_corr_ks, auto_corr, 'go', alpha=0.5, label='strange autocorrelation at 0.25')

Autocorrelation period 1675

In [None]:
period = 1675
auto_corr_ks = list(range(period, max_lag, period))
print(len(auto_corr_ks))
print(auto_corr_ks)

In [None]:
auto_corr = np.array([1] + [np.corrcoef(x[:-k], x[k:])[0, 1] for k in auto_corr_ks])
auto_corr

In [None]:
plt.plot([0] + auto_corr_ks, auto_corr, 'ro', markersize=10, alpha=0.5, label='one week = 16.75?')

In [None]:
fig = plt.figure()
plt.plot(auto_corr, 'k.', label='autocorrelation by 0.01')
plt.title('Train Sensor Time Auto-correlation')

period = 25
auto_corr_ks = list(range(period, max_lag, period))
auto_corr = np.array([1] + [np.corrcoef(x[:-k], x[k:])[0, 1] for k in auto_corr_ks])
plt.plot([0] + auto_corr_ks, auto_corr, 'go', alpha=0.5, label='strange autocorrelation at 0.25')

period = 1675
auto_corr_ks = list(range(period, max_lag, period))
auto_corr = np.array([1] + [np.corrcoef(x[:-k], x[k:])[0, 1] for k in auto_corr_ks])
plt.plot([0] + auto_corr_ks, auto_corr, 'ro', markersize=10, alpha=0.5, label='one week = 16.75?')

plt.xlabel('k * 0.01 -  autocorrelation lag')
plt.ylabel('autocorrelation')
plt.legend(loc=0)
#fig.savefig('train_time_auto_correlation.png', dpi=300)

The largest peaks are at approximately 1680 ticks. Let's call it a week ;)

In each week we could see 7 local maxima ~ days.

In [None]:
train_time_cnt

In [None]:
week_duration = 1679
train_time_cnt['week_part'] = ((train_time_cnt['time'].values * 100) % week_duration).astype(np.int64)
train_time_cnt
# Week_part sẽ lặp lại cứ mỗi chu kỳ 1679 đi qua

In [None]:
print(len(set(train_time_cnt.time)))
print(len(set(train_time_cnt.week_part)))

In [None]:
x = 32
y = 15
print(x % y)

In [None]:
x = 47
y = 15
print(x % y)

In [None]:
print(0.01*100 % week_duration)
print(10*100 % week_duration)
print(1718*100 % week_duration)

In [None]:
fig = plt.figure()
plt.plot(train_time_cnt.time.values, train_time_cnt.cnt.values, 'b.',
         alpha=0.5, label='train count')

In [None]:
# Aggregate weekly stats
train_week_part = train_time_cnt.groupby(['week_part'])[['cnt']].sum().reset_index()
train_week_part

In [None]:
fig = plt.figure()
plt.plot(train_week_part.week_part.values, train_week_part.cnt.values, 'b.',
         alpha=0.5, label='train count')
# Gộp toàn bộ count station theo tuần rồi plot

In [None]:
y_train = train_week_part['cnt'].rolling(window=20, center=True).mean().values
y_train

In [None]:
plt.plot(train_week_part.week_part.values, y_train, 'b-', linewidth=4, alpha=0.5, label='train count smooth')

In [None]:
week_duration = 1679
train_time_cnt['week_part'] = ((train_time_cnt['time'].values * 100) % week_duration).astype(np.int64)
# Aggregate weekly stats
train_week_part = train_time_cnt.groupby(['week_part'])[['cnt']].sum().reset_index()
fig = plt.figure()
plt.plot(train_week_part.week_part.values, train_week_part.cnt.values, 'b.', alpha=0.5, label='train count')
y_train = train_week_part['cnt'].rolling(window=20, center=True).mean().values
plt.plot(train_week_part.week_part.values, y_train, 'b-', linewidth=4, alpha=0.5, label='train count smooth')
plt.title('Relative Part of week')
plt.ylabel('Number of records')
plt.xlim(0, 1680)
fig.savefig('week_duration.png', dpi=300)

# 69% failure rate

Station combinations
We have seen station 32 has high (4.7%) error rate.

Let's investigate that failure rate with station combinations.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gc
sns.set_style('whitegrid')

Read station and response data
['S32', 'S33', 'S34'] have the most interesting pattern.

We read the full train set although only 5 columns.

In [None]:
train_date_part = pd.read_csv('../input/bosch-production-line-performance/train_date.csv.zip', nrows=10000)
train_date_part.shape

In [None]:
date_cols = train_date_part.drop('Id', axis=1).count().reset_index().sort_values(by=0, ascending=False)
date_cols

In [None]:
date_cols['station'] = date_cols['index'].apply(lambda s: s.split('_')[1])
date_cols

In [None]:
STATIONS = ['S32', 'S33', 'S34']
date_cols = date_cols[date_cols['station'].isin(STATIONS)]
date_cols

In [None]:
date_cols = date_cols.drop_duplicates('station', keep='first')['index'].tolist()
date_cols

In [None]:
train_date = pd.read_csv('../input/bosch-production-line-performance/train_date.csv.zip', usecols=['Id'] + date_cols)
print(train_date.columns)
print(train_date.shape)
train_date

In [None]:
STATIONS = ['S32', 'S33', 'S34']
train_date.columns = ['Id'] + STATIONS
train_date

In [None]:
for station in STATIONS:
    train_date[station] = 1 * (train_date[station] >= 0)
train_date

In [None]:
response = pd.read_csv('../input/bosch-production-line-performance/train_numeric.csv.zip', usecols=['Id', 'Response'])
print(response.shape)
response

In [None]:
train = response.merge(train_date, how='left', on='Id')
print(train.shape)
train

In [None]:
train.Response.value_counts()

# DNN WAY

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
nrows=100000 # total full row: 1,183,747
train_date= pd.read_csv('../input/bosch-production-line-performance/train_date.csv.zip', nrows=nrows)
print(train_date.shape)
train_numeric = pd.read_csv('../input/bosch-production-line-performance/train_numeric.csv.zip', nrows=nrows)
print(train_numeric.shape)
#train_categorical = pd.read_csv('../input/bosch-production-line-performance/train_categorical.csv.zip', nrows=nrows)
#print(train_categorical.shape)

In [None]:
train_date.head()

In [None]:
train_numeric.head()

In [None]:
train = train_date.merge(train_numeric, how='left', on='Id')
train

In [None]:
print(train.Response.value_counts(normalize=True))
print(train.Response.value_counts())

In [None]:
train

## Fill na with Mean value or use XGBoost can handle nan value

# XGBoost from Kaggle

https://www.kaggle.com/joconnor/python-xgboost-starter-0-209-public-mcc

In [None]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.metrics import matthews_corrcoef, roc_auc_score
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [None]:
# I'm limited by RAM here and taking the first N rows is likely to be
# a bad idea for the date data since it is ordered.
# Sample the data in a roundabout way:
date_chunks = pd.read_csv("../input/bosch-production-line-performance/train_date.csv.zip", index_col=0, chunksize=100000, dtype=np.float32)
num_chunks = pd.read_csv("../input/bosch-production-line-performance/train_numeric.csv.zip", index_col=0,
                         usecols=list(range(969)), chunksize=100000, dtype=np.float32)

In [None]:
X = pd.concat([pd.concat([dchunk, nchunk], axis=1).sample(frac=0.05)
               for dchunk, nchunk in zip(date_chunks, num_chunks)])
X

In [None]:

y = pd.read_csv("../input/bosch-production-line-performance/train_numeric.csv.zip", index_col=0, usecols=[0,969], dtype=np.float32)\
.loc[X.index].values.ravel()
y


In [None]:
len(y)

In [None]:
clf = XGBClassifier(base_score=0.005)
clf.fit(X, y)

In [None]:
import matplotlib.pyplot as plt
# threshold for a manageable number of features
plt.hist(clf.feature_importances_[clf.feature_importances_>0])
important_indices = np.where(clf.feature_importances_>0.005)[0]
print(important_indices)

In [None]:
# load entire dataset for these features. 
# note where the feature indices are split so we can load the correct ones straight from read_csv
n_date_features = 1156
X = np.concatenate([
    pd.read_csv("../input/bosch-production-line-performance/train_date.csv.zip", index_col=0, dtype=np.float32,
                usecols=np.concatenate([[0], important_indices[important_indices < n_date_features] + 1])).values,
    pd.read_csv("../input/bosch-production-line-performance/train_numeric.csv.zip", index_col=0, dtype=np.float32,
                usecols=np.concatenate([[0], important_indices[important_indices >= n_date_features] + 1 - 1156])).values
], axis=1)
y = pd.read_csv("../input/bosch-production-line-performance/train_numeric.csv.zip", index_col=0, dtype=np.float32, usecols=[0,969]).values.ravel()

In [None]:
X.shape

In [None]:
y.shape

In [None]:
pd.DataFrame(y)[0].value_counts(normalize=True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
clf = XGBClassifier(max_depth=5, base_score=0.005)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))
print(roc_auc_score(y_test,y_pred))

In [None]:
from collections import Counter
# count examples in each class
counter = Counter(y)
# estimate scale_pos_weight value
estimate = counter[0] / counter[1]
print('Estimate: %.3f' % estimate)

In [None]:
clf = XGBClassifier(scale_pos_weight=171)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))
print(roc_auc_score(y_test,y_pred))

In [None]:
clf = XGBClassifier(scale_pos_weight=250)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))
print(roc_auc_score(y_test,y_pred))

# Random forest

https://www.kaggle.com/aakashveera/random-forest

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import tqdm
import gc
import sys
import warnings
warnings.filterwarnings("ignore")

In [None]:
date = pd.read_csv('../input/bosch-production-line-performance/train_date.csv.zip', nrows=10000)
numeric = pd.read_csv('../input/bosch-production-line-performance/train_numeric.csv.zip', nrows=10000)
category = pd.read_csv('../input/bosch-production-line-performance/train_categorical.csv.zip', nrows=10000)
# Mục đích đọc 10000 dòng là để lấy các thông tin về station name

In [None]:
date

In [None]:
numeric

In [None]:
category

FEATURE ENGINEERING

The list of numeric features is selected based on the other XGBOOST classifier check the numericclassifier notebook

In [None]:
num_feats = ['Id',
       'L3_S30_F3514', 'L0_S9_F200', 'L3_S29_F3430', 'L0_S11_F314',
       'L0_S0_F18', 'L3_S35_F3896', 'L0_S12_F350', 'L3_S36_F3918',
       'L0_S0_F20', 'L3_S30_F3684', 'L1_S24_F1632', 'L0_S2_F48',
       'L3_S29_F3345', 'L0_S18_F449', 'L0_S21_F497', 'L3_S29_F3433',
       'L3_S30_F3764', 'L0_S1_F24', 'L3_S30_F3554', 'L0_S11_F322',
       'L3_S30_F3564', 'L3_S29_F3327', 'L0_S2_F36', 'L0_S9_F180',
       'L3_S33_F3855', 'L0_S0_F4', 'L0_S21_F477', 'L0_S5_F114',
       'L0_S6_F122', 'L1_S24_F1122', 'L0_S9_F165', 'L0_S18_F439',
       'L1_S24_F1490', 'L0_S6_F132', 'L3_S29_F3379', 'L3_S29_F3336',
       'L0_S3_F80', 'L3_S30_F3749', 'L1_S24_F1763', 'L0_S10_F219',
 'Response']

In [None]:
length = date.drop('Id', axis=1).count()
date_cols = length.reset_index().sort_values(by=0, ascending=False)
date_cols

In [None]:
stations = sorted(date_cols['index'].str.split('_',expand=True)[1].unique().tolist())
stations

In [None]:
len(stations)

In [None]:
date_cols['station'] = date_cols['index'].str.split('_',expand=True)[1]
date_cols

In [None]:
date_cols = date_cols.drop_duplicates('station', keep='first')['index'].tolist()
date_cols

Chỉ giữ lại duy nhất các unique station column, tương ứng với feature measurement và line

In [None]:
data = None
for chunk in pd.read_csv('../input/bosch-production-line-performance/train_date.csv.zip',usecols=['Id'] + date_cols,chunksize=50000,low_memory=False):

    chunk.columns = ['Id'] + stations
    chunk['start_station'] = -1
    chunk['end_station'] = -1
    
    for s in stations:
        chunk[s] = 1 * (chunk[s] >= 0)
        id_not_null = chunk[chunk[s] == 1].Id
        chunk.loc[(chunk['start_station']== -1) & (chunk.Id.isin(id_not_null)),'start_station'] = int(s[1:])
        chunk.loc[chunk.Id.isin(id_not_null),'end_station'] = int(s[1:])   
    data = pd.concat([data, chunk])

In [None]:
data

In [None]:
data = data[['Id','start_station','end_station']]
usefuldatefeatures = ['Id']+date_cols

In [None]:
len(date_cols)

In [None]:
data

In [None]:
usefuldatefeatures

In [None]:
len(chunk.columns.values.tolist())

In [None]:
minmaxfeatures = None
for chunk in pd.read_csv('../input/bosch-production-line-performance/train_date.csv.zip',usecols=usefuldatefeatures,chunksize=50000,low_memory=False):
    features = chunk.columns.values.tolist()
    features.remove('Id')
    df_mindate_chunk = chunk[['Id']].copy()
    df_mindate_chunk['mindate'] = chunk[features].min(axis=1).values
    df_mindate_chunk['maxdate'] = chunk[features].max(axis=1).values
    df_mindate_chunk['min_time_station'] =  chunk[features].idxmin(axis = 1).apply(lambda s: int(s.split('_')[1][1:]) if s is not np.nan else -1)
    df_mindate_chunk['max_time_station'] =  chunk[features].idxmax(axis = 1).apply(lambda s: int(s.split('_')[1][1:]) if s is not np.nan else -1)
    minmaxfeatures = pd.concat([minmaxfeatures, df_mindate_chunk])

del chunk
gc.collect()

In [None]:
df_mindate_chunk

In [None]:
minmaxfeatures.sort_values(by=['mindate', 'Id'], inplace=True)
minmaxfeatures['min_Id_rev'] = -minmaxfeatures.Id.diff().shift(-1)
minmaxfeatures['min_Id'] = minmaxfeatures.Id.diff()

In [None]:
minmaxfeatures

In [None]:
cols = [['Id']+date_cols,num_feats]
traindata = None
trainfiles = ['train_date.csv.zip','train_numeric.csv.zip']

In [None]:
cols

In [None]:
for i,f in enumerate(trainfiles):
    
    subset = None
    
    for chunk in pd.read_csv('../input/bosch-production-line-performance/' + f,usecols=cols[i],chunksize=100000,low_memory=False):
        subset = pd.concat([subset, chunk])
    
    if traindata is None:
        traindata = subset.copy()
    else:
        traindata = pd.merge(traindata, subset.copy(), on="Id")
        
del subset,chunk
gc.collect()
del cols[1][-1]

In [None]:
traindata

In [None]:
traindata = traindata.merge(minmaxfeatures, on='Id')
traindata = traindata.merge(data, on='Id')
del minmaxfeatures,data
gc.collect()

In [None]:
traindata

In [None]:
traindata.fillna(value=0,inplace=True)
traindata

In [None]:
def mcc(tp, tn, fp, fn):
    num = tp * tn - fp * fn
    den = (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)
    if den == 0:
        return 0
    else:
        return num / np.sqrt(den)

In [None]:
def eval_mcc(y_true, y_prob):
    idx = np.argsort(y_prob)
    y_true_sort = y_true[idx]
    n = y_true.shape[0]
    nump = 1.0 * np.sum(y_true) 
    numn = n - nump 
    tp,fp = nump,numn
    tn,fn = 0.0,0.0
    best_mcc = 0.0
    best_id = -1
    mccs = np.zeros(n)
    for i in range(n):
        if y_true_sort[i] == 1:
            tp -= 1.0
            fn += 1.0
        else:
            fp -= 1.0
            tn += 1.0
        new_mcc = mcc(tp, tn, fp, fn)
        mccs[i] = new_mcc
        if new_mcc >= best_mcc:
            best_mcc = new_mcc
            best_id = i
    return best_mcc

In [None]:
def mcc_eval(y_prob, dtrain):
    y_true = dtrain.get_label()
    best_mcc = eval_mcc(y_true, y_prob)
    return 'MCC', best_mcc

In [None]:
np.set_printoptions(suppress=True)
import gc
# lấy random 400,000 sample với Response ==0
total = traindata[traindata['Response']==0].sample(frac=1).head(400000) 
# Sau đó gộp với 6879 sample với Response ==1, trộn ngẫu nhiên toàn bộ mẫu này
total = pd.concat([total,traindata[traindata['Response']==1]]).sample(frac=1)
total

In [None]:
from sklearn.model_selection import train_test_split
X,y = total.drop(['Response','Id'],axis=1),total['Response']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,stratify=y)

In [None]:
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

In [None]:
X_train

In [None]:
y_train

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score,precision_score,plot_precision_recall_curve
from sklearn.metrics import confusion_matrix,classification_report,matthews_corrcoef

In [None]:
model = RandomForestClassifier(n_estimators=500,n_jobs=-1,verbose=1,random_state=11)
model.fit(X_train,y_train)
pred = model.predict(X_test)

In [None]:
print(classification_report(pred,y_test))
print(matthews_corrcoef(y_test,pred))
confusion_matrix(y_test,pred)

In [None]:
plot_precision_recall_curve(model,X_test,y_test)