## 1. DATA MANIPULATIONS

### 1.1. In our case, two manipulations should be made to datesets.

    1. In all of the datasets, there are one redundant data at the most left side per row. Looks like these are some 
    kind of 'id' features, but in our case we don't need them. So first thing, we need to get rid of these.
    2. In datatest2, there are lack of double quotes (") around the 'date' column. To use our data efficiently, we 
    need to add them.

In [49]:
lines = []
#First, we need to read the txt file. Then get every line of the file into an array: lines.
with open('/home/emremrah/Desktop/DataMin/Project1/datatraining.txt', 'r') as file:
    lines = file.readlines()
    
new_lines = []
#Copy the 'header' row first.
new_lines.append(lines[0])
# Then, to copy each row, after split them by comma, we skip the most left data, and copy rest of it.
for line in lines[1:]:
    new_lines.append(','.join(l for l in line.split(',')[1:]))

# Then write our new data into a CSV file.
with open('/home/emremrah/Desktop/DataMin/Project1/datatraining.csv', 'w') as file:
    file.writelines(new_lines)
    
################################################################################
    
with open('/home/emremrah/Desktop/DataMin/Project1/datatest.txt', 'r') as file:
    lines = file.readlines()
    
new_lines = []
new_lines.append(lines[0])
for line in lines[1:]:
    new_lines.append(','.join(l for l in line.split(',')[1:]))
    
with open('/home/emremrah/Desktop/DataMin/Project1/datatest.csv', 'w') as file:
    file.writelines(new_lines)

################################################################################
    
with open('/home/emremrah/Desktop/DataMin/Project1/datatest2.txt', 'r') as file:
    lines = file.readlines()
    
new_lines = []
new_lines.append(lines[0])
# Here, we will add the quotes around the 'date' column.
for line in lines[1:]:
    i = line.index(',') + 1
    i2 = line[i:].index(',')
    line = line[:i] + '"' + line[i:i+i2] + '"' + line[i+i2:]
    new_lines.append(','.join(l for l in line.split(',')[1:]))
    
with open('/home/emremrah/Desktop/DataMin/Project1/datatest2.csv', 'w') as file:
    file.writelines(new_lines)

#### After this, we can get our data to a pandas dataframe.

In [64]:
import pandas as pd

path = "/home/emremrah/Desktop/DataMin/Project1/datatraining.csv"
train = pd.read_csv(path)
path = "/home/emremrah/Desktop/DataMin/Project1/datatest.csv"
test = pd.read_csv(path)
path = "/home/emremrah/Desktop/DataMin/Project1/datatest2.csv"
test2 = pd.read_csv(path)

#### Now, let's take a look to our data.

In [66]:
print("Training set:\n{}".format(train.head()))
# print("\nTest set #1:\n{}".format(test.head()))
# print("\nTest set #2:\n{}".format(test2.head()))
print("\nKeys of TRAINING data:\n{}".format(train.keys()))
print("\nKeys of TEST1 data:\n{}".format(test.keys()))
print("\nKeys of TEST2 data:\n{}".format(test.keys()))

Training set:
                  date  Temperature  Humidity  Light     CO2  HumidityRatio  \
0  2015-02-04 17:51:00        23.18   27.2720  426.0  721.25       0.004793   
1  2015-02-04 17:51:59        23.15   27.2675  429.5  714.00       0.004783   
2  2015-02-04 17:53:00        23.15   27.2450  426.0  713.50       0.004779   
3  2015-02-04 17:54:00        23.15   27.2000  426.0  708.25       0.004772   
4  2015-02-04 17:55:00        23.10   27.2000  426.0  704.50       0.004757   

   Occupancy  
0          1  
1          1  
2          1  
3          1  
4          1  

Keys of TRAINING data:
Index(['date', 'Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio',
       'Occupancy'],
      dtype='object')

Keys of TEST1 data:
Index(['date', 'Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio',
       'Occupancy'],
      dtype='object')

Keys of TEST2 data:
Index(['date', 'Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio',
       'Occupancy'],
      dtype='object')


#### As you can see, all of them have the same struct. After some time, we are going to drop the 'Occupancy' column from training set, and drop all features except 'Occupancy' column from test1 and test2 sets to first train then test our models.

### 1.2. After that, here is another struggle, the date feature can't be processed by our algorithms right now.

    So, I want to change the date to a 'weekday' feature. By doing that, from monday to sunday, each day will
    correspond to an integer number from 0 to 6. Which can be used by our algorithms.
    
    To do that I use datetime.strptime function to get weekday from our preprocessed 'date' data.

In [67]:
import datetime

def dt(st):
    return datetime.datetime.strptime(st, '%Y-%m-%d %H:%M:%S').weekday()

for i, dd in enumerate(train['date']):
    train.iloc[i, train.columns.get_loc('date')] = dt(train['date'].iloc[i])
    
for i, dd in enumerate(test['date']):
    test.iloc[i, test.columns.get_loc('date')] = dt(test['date'].iloc[i])
    
for i, dd in enumerate(test2['date']):
    test2.iloc[i, test2.columns.get_loc('date')] = dt(test2['date'].iloc[i])

#### Now our 'date' column only cosist weekday from 0 to 6 for each weekday.

In [68]:
print("Training set:\n{}".format(train.head()))

Training set:
   date  Temperature  Humidity  Light     CO2  HumidityRatio  Occupancy
0     2        23.18   27.2720  426.0  721.25       0.004793          1
1     2        23.15   27.2675  429.5  714.00       0.004783          1
2     2        23.15   27.2450  426.0  713.50       0.004779          1
3     2        23.15   27.2000  426.0  708.25       0.004772          1
4     2        23.10   27.2000  426.0  704.50       0.004757          1


## 2. CLOSE LOOK TO DATASET

#### After preparing our data to process, we will take a close look to understand what we have.

### 2.1. Understanding the data

#### Lets take a look to our data's shape:

In [69]:
print("\nTrain shape: {}".format(train.shape))
print("\nTest1 shape: {}".format(test.shape))
print("\nTest2 shape: {}".format(test2.shape))


Train shape: (8143, 7)

Test1 shape: (2665, 7)

Test2 shape: (9752, 7)


#### As seen, since our data already has splitted, we don't need to split it again.
    
    Several things should be made right now:
    
    In all of data, we are going to split them to x and y (input-output).
    
    Since 'Occupancy' isn't a feature, it's our output that we want to predict, we must seperate it from X (input)
    sets; we only want it in the y (output) sets. If we don't do that, models will suppose 'Occupancy' is a feature,
    but it's not.
    

In [70]:
target = train["Occupancy"].values
target2 = test["Occupancy"].values
target3 = test2["Occupancy"].values

xtrain = train;
del xtrain["Occupancy"]
ytrain = target


xtest = test;
del xtest["Occupancy"]
ytest = target2

xtest2 = test2;
del xtest2["Occupancy"]
ytest2 = target3

    Here we made it. Now all trains have 6 features except 'Occupancy' and all test have only 'Occupancy'.

In [77]:
print("xtrain shape: {}\n".format(xtrain.shape))
print("ytrain shape: {}\n".format(ytrain.shape))
print("ytrain: {}\n".format(ytrain))

xtrain shape: (8143, 6)

ytrain shape: (8143,)

ytrain: [1 1 1 ..., 1 1 1]

