In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

A good chocolate soufflé is decadent, delicious, and delicate. But, it's a challenge to prepare. When you pull a disappointingly deflated dessert out of the oven, you instinctively retrace your steps to identify at what point you went wrong. Bosch, one of the world's leading manufacturing companies, has an imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. Part of doing so is closely monitoring its parts as they progress through the manufacturing processes.

Because Bosch records data at every step along its assembly lines, they have the ability to apply advanced analytics to improve these manufacturing processes. However, the intricacies of the data and complexities of the production line pose problems for current methods.

In this competition, Bosch is challenging Kagglers to predict internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable Bosch to bring quality products at lower costs to the end user.

Submissions are evaluated on the Matthews correlation coefficient (MCC) between the predicted and the observed response. The MCC is given by:

MCC=(TP∗TN)−(FP∗FN)(TP+FP)(TP+FN)(TN+FP)(TN+FN)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√,


where TP is the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives.

Data Description

The data for this competition represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).

The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

On account of the large size of the dataset, we have separated the files by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken.

In addition to being one of the largest datasets (in terms of number of features) ever hosted on Kaggle, the ground truth for this competition is highly imbalanced. Together, these two attributes are expected to make this a challenging problem.

File descriptions

- train_numeric.csv - the training set numeric features (this file contains the 'Response' variable)
- test_numeric.csv - the test set numeric features (you must predict the 'Response' for these Ids)
- train_categorical.csv - the training set categorical features
- test_categorical.csv - the test set categorical features
- train_date.csv - the training set date features
- test_date.csv - the test set date features
- sample_submission.csv - a sample submission file in the correct format

Giai thich dataset:
1. Moi line se co mot so station, id cua tung station la rieng biet o tung line, do do chi can biet part nam o station nao la duoc roi, sau do se tu truy ra line nao
2. Moi part khi di qua station se duoc do cac feature tuong ung tai mot thoi gian cu the, do do column se theo ID (part), time (thoi gian do), L3_S36_f3939: part do vao thoi gian 87.2 dang o line 3, station 36, do feature 3939
3. Tai 1 station part se do nhieu feature khac nhau , thoi gian do tat ca feature cua part do tai station do la nhu nhau

In [None]:
train_date_part = pd.read_csv('../input/bosch-production-line-performance/train_date.csv.zip', nrows=10000)
print(train_date_part.shape)
train_date_part

In [None]:
# Let's check the min and max times for each station
# Tach tung cot ra, sau do dropna tung cot, roi append tat ca lai voi nhau
    
dates=train_date_part.copy()
withId=True
times = []
cols = list(dates.columns)
if 'Id' in cols:
    cols.remove('Id')
for feature_name in cols:
    if withId:
        df = dates[['Id', feature_name]].copy()
        df.columns = ['Id', 'time']
    else:
        df = dates[[feature_name]].copy()
        df.columns = ['time']
    df['line'] = feature_name.split('_')[0][1:]
    df['station'] = feature_name.split('_')[1][1:]
    df['feature_number'] = feature_name.split('_')[2][1:]
    df = df.dropna()
    print(df.shape)
    times.append(df)
    print(len(times))
station_times=pd.concat(times)
station_times

In [None]:
station_times=station_times.sort_values(by=['Id','station'])
station_times['line']=station_times['line'].astype('int64')
station_times['station']=station_times['station'].astype('int64')
station_times['feature_number']=station_times['feature_number'].astype('int64')
print(station_times.dtypes)
station_times

In [None]:
# How many station in each line ?
set(station_times[station_times.line==0].station)

In [None]:
# How many station in each time ?
print(set(station_times[station_times.time==82.24].station))
# How many part in each time ?
print(set(station_times[station_times.time==82.24].Id))

In [None]:
time=82.24
print('time: ',time)
station_=set(station_times[station_times.time==82.24].station)
for j in station_:
    print('station:', j)
    print('part: ',set(station_times[(station_times.time==time) & (station_times.station==j)].Id))
    print('feature_number: ',set(station_times[(station_times.time==time) & (station_times.station==j)].feature_number))
        

In [None]:
time=1379.78
print('time: ',time)
station_=set(station_times[station_times.time==82.24].station)
for j in station_:
    print('station:', j)
    print('part: ',set(station_times[(station_times.time==time) & (station_times.station==j)].Id))
    print('feature_number: ',set(station_times[(station_times.time==time) & (station_times.station==j)].feature_number))
        

In [None]:
set(station_times[station_times.line==1].station)

In [None]:
set(station_times[station_times.line==2].station)

In [None]:
set(station_times[station_times.line==3].station)

Each station is unique in each line, so no need to include line here

In [None]:
station_times.line.value_counts()

In [None]:
set(station_times.Id)

In [None]:
station_times.station.value_counts()

In [None]:
station_times.feature_number.value_counts()

In [None]:
part_id=6
part_filter=station_times[station_times.Id==part_id]
part_filter_line=set(part_filter.line)
part_filter_station=set(part_filter.station)
print('total_line: ',part_filter_line)
print('total_station: ',part_filter_station)

for i in part_filter_line:
    print('line:', i)
    for j in part_filter_station:
        print('station:', j)
        print('feature_number: ',set(part_filter[(part_filter.line==i) & (part_filter.station==j)].feature_number))
        print('time: ',set(part_filter[(part_filter.line==i) & (part_filter.station==j)].time))
        

In [None]:
def part_info(part_id):
    #part_id=part_id
    part_filter=station_times[station_times.Id==part_id]
    part_filter_line=set(part_filter.line)
    part_filter_station=set(part_filter.station)
    print('total_line: ',part_filter_line)
    print('total_station: ',part_filter_station)
    print('-'*60)

    for i in part_filter_line:
        print('-'*10)
        print('line:', i)
        for j in part_filter_station:
            print('station:', j)
            print('feature_number: ',set(part_filter[(part_filter.line==i) & (part_filter.station==j)].feature_number))
            print('time: ',set(part_filter[(part_filter.line==i) & (part_filter.station==j)].time))
        

In [None]:
part_info(120)

In [None]:
min_station_times = station_times.groupby(['Id', 'station']).min()['time']
max_station_times = station_times.groupby(['Id', 'station']).max()['time']

In [None]:
min_station_times

In [None]:
max_station_times

In [None]:
train_date_part

In [None]:
train_date_part.count()

In [None]:
# Read station times for train and test
date_cols = train_date_part.drop('Id', axis=1).count().reset_index().\
            sort_values(by=0, ascending=False)
date_cols

In [None]:
date_cols['station'] = date_cols['index'].apply(lambda s: s.split('_')[1])
date_cols

In [None]:
date_cols = date_cols.drop_duplicates('station', keep='first')['index'].tolist()
date_cols # selected features
# remove all duplicate station (with differtion feature measurment each station)

In [None]:
# applied these columns to all training data set
train_date = pd.read_csv('../input/bosch-production-line-performance/train_date.csv.zip', usecols=date_cols)
print(train_date.shape)
train_date

In [None]:
dates=train_date.copy()
withId=False
times = []
cols = list(dates.columns)
if 'Id' in cols:
    cols.remove('Id')
for feature_name in cols:
    if withId:
        df = dates[['Id', feature_name]].copy()
        df.columns = ['Id', 'time']
    else:
        df = dates[[feature_name]].copy()
        df.columns = ['time']
    df['line'] = feature_name.split('_')[0][1:]
    df['station'] = feature_name.split('_')[1][1:]
    df['feature_number'] = feature_name.split('_')[2][1:]
    df = df.dropna()
    #print(df.shape)
    times.append(df)
    #print(len(times))
train_station_times=pd.concat(times)
print(train_station_times.shape)
train_station_times
# Do chi giu lai 52 columns nen tong so dong 14 tr, khong qua nhieu, neu giu lai 1000 columns thi
# con so se rat lon

In [None]:
train_time_cnt = train_station_times.groupby('time').count()[['station']].reset_index()
train_time_cnt.columns = ['time', 'cnt']
print(train_time_cnt.shape)
train_time_cnt
# Loc thoi gian testing tung feature ung voi bao nhieu station.

In [None]:
test_date = pd.read_csv('../input/bosch-production-line-performance/test_date.csv.zip', usecols=date_cols)
print(test_date.shape)
test_date

In [None]:
dates=test_date.copy()
withId=False
times = []
cols = list(dates.columns)
if 'Id' in cols:
    cols.remove('Id')
for feature_name in cols:
    if withId:
        df = dates[['Id', feature_name]].copy()
        df.columns = ['Id', 'time']
    else:
        df = dates[[feature_name]].copy()
        df.columns = ['time']
    df['line'] = feature_name.split('_')[0][1:]
    df['station'] = feature_name.split('_')[1][1:]
    df['feature_number'] = feature_name.split('_')[2][1:]
    df = df.dropna()
    #print(df.shape)
    times.append(df)
    #print(len(times))
test_station_times=pd.concat(times)
print(test_station_times.shape)
test_station_times

In [None]:
test_time_cnt = test_station_times.groupby('time').count()[['station']].reset_index()
test_time_cnt.columns = ['time', 'cnt']
print(test_time_cnt.shape)

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure()
plt.plot(train_time_cnt['time'].values, train_time_cnt['cnt'].values, 'b.', alpha=0.1, label='train')
plt.plot(test_time_cnt['time'].values, test_time_cnt['cnt'].values, 'r.', alpha=0.1, label='test')
plt.title('Original date values')
plt.ylabel('Number of records')
plt.xlabel('Time')
fig.savefig('original_date_values.png', dpi=300)
plt.show()

In [None]:
print((train_time_cnt['time'].min(), train_time_cnt['time'].max()))
print((test_time_cnt['time'].min(), test_time_cnt['time'].max()))

A few observations:

- Train and test set has the same time period
- There is a clear periodic pattern
- The dates are transformed to 0 - 1718 with granularity of 0.01
- There is a gap in the middle

Could we figure out what does 0.01 mean? Let's check a few auto correlations!

In [None]:
time_ticks = np.arange(train_time_cnt['time'].min(), train_time_cnt['time'].max() + 0.01, 0.01)
time_ticks = pd.DataFrame({'time': time_ticks})
time_ticks

In [None]:
time_ticks = pd.merge(time_ticks, train_time_cnt, how='left', on='time')
time_ticks = time_ticks.fillna(0)
time_ticks
# Dem bao nhieu station lien quan toi specific time trong toan bo data set

In [None]:
# Autocorrelation
x = time_ticks['cnt'].values
max_lag = 8000
auto_corr_ks = range(1, max_lag)
auto_corr = np.array([1] + [np.corrcoef(x[:-k], x[k:])[0, 1] for k in auto_corr_ks])
auto_corr

In [None]:
print(len(auto_corr_ks))
print(auto_corr_ks)
print(x)

In [None]:
k=1
print('k',k)
print(len(x[:-k]))
print(len(x[k:]))
print(x[:-k])
print(x[k:])
print('corrcoef: \n',np.corrcoef(x[:-k], x[k:]))
print('corrcoef: \n',np.corrcoef(x[:-k], x[k:])[0,1])
print('corrcoef: \n',np.array([1]+ np.corrcoef(x[:-k], x[k:])[0,1]))

In [None]:
k=3
print('k',k)
print(len(x[:-k]))
print(len(x[k:]))
print(x[:-k])
print(x[k:])
print('corrcoef: \n',np.corrcoef(x[:-k], x[k:]))
print('corrcoef: \n',np.corrcoef(x[:-k], x[k:])[0,1])
print('corrcoef: \n',np.array([1]+ np.corrcoef(x[:-k], x[k:])[0,1]))

In [None]:
fig = plt.figure()
plt.plot(auto_corr, 'k.', label='autocorrelation by 0.01')
plt.title('Train Sensor Time Auto-correlation')
period = 25
auto_corr_ks = list(range(period, max_lag, period))
auto_corr = np.array([1] + [np.corrcoef(x[:-k], x[k:])[0, 1] for k in auto_corr_ks])
plt.plot([0] + auto_corr_ks, auto_corr, 'go', alpha=0.5, label='strange autocorrelation at 0.25')
period = 1675
auto_corr_ks = list(range(period, max_lag, period))
auto_corr = np.array([1] + [np.corrcoef(x[:-k], x[k:])[0, 1] for k in auto_corr_ks])
plt.plot([0] + auto_corr_ks, auto_corr, 'ro', markersize=10, alpha=0.5, label='one week = 16.75?')
plt.xlabel('k * 0.01 -  autocorrelation lag')
plt.ylabel('autocorrelation')
plt.legend(loc=0)
#fig.savefig('train_time_auto_correlation.png', dpi=300)

The largest peaks are at approximately 1680 ticks. Let's call it a week ;)

In each week we could see 7 local maxima ~ days.