# Predicting the occupancies of Belgian trains

## 0. Tips and tricks Jupyter notebooks

* Export your notebooks locally: File -> Download as -> html/ipynb (do both!) (NOTE: don't trust the virtual environment, always download your latest version locally!)

* Shift + Enter to run a cell (or play button)

* TAB for code completion!

* Shells can be Code but also Markdown! (dropdown menu on top allows you to choose)

* Cell -> Run All might come in useful

* In case of problems you can always Kernel -> Restart

* In the Jupyter start window you have a 'new' button on the upper right => to open a terminal, a new python3 notebook, etc.

* You cannot upload very large files (>3GB) via the upload in Jupyter (bug), the datasets will always be available via een public Amazon url, so use wget or scp to get your data in /mnt on the virtual wall!


## 1. Data Preparation

### Essential Libraries for Python Data Science

In [1]:
#vector/matrix library
import numpy as np
#data frame library (similar to R)
import pandas as pd

#visualization library
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

#regular expression library for data cleasning
import re

### Detour: Get up to speed with python / pandas

* If you are new to python have a look at the practicelab.ipynb in the github repo
* If you are new to pandas take some time for:
    - <a href="http://pandas.pydata.org/pandas-docs/stable/10min.html"> Pandas basics in 10 minutes </a>
    - <a href="http://pandas.pydata.org/pandas-docs/stable/visualization.html"> Pandas for data visualizations </a>
    - Use the pandas cheat sheet in the github repo
    
    
#### TIP: Pandas dataframes are immutable, so appending a row creates a copies the old df in the new one => creating a df like this is an O(N^2) operation (I learnt this the hard way)    
 


### 1a. Dataset characteristics

1. Read the train and test datasets
2. How many records and how many features are in train and test sets?
3. What is the range of the querytime column (earliest + latest date)
4. Merge training and test data (data cleansing will be the same for both)
5. Drop columns querytype and user_agent since they contain no useful info, this will speedup future df calculations

In [2]:
#1
path_train = './data/training_data.nldjson'
path_test  = './data/test.nldjson' 

In [3]:
#warning the input files are slightly different format, training data is in new-line delimited json, 
#testdata in regular json
df_train = pd.read_json(path_train, lines=True)
df_test = pd.read_json(path_test)

#to have a look at a dataframe just use head or tail
df_train.head(n=5)

Unnamed: 0,post,querytime,querytype,user_agent
0,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:05:51+02:00,occupancy,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
1,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:06:11+02:00,occupancy,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
2,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:08:57+02:00,occupancy,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
3,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:09:08+02:00,occupancy,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
4,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:11:01+02:00,occupancy,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...


In [None]:
df_test.head(n=5)

In [4]:
#2
print("Number of records in training set: " + str(df_train.shape[0]))
print("Number of records in test set: " + str(df_test.shape[0]))


Number of records in training set: 5062
Number of records in test set: 493


In [5]:

print("Features training data: " + str(list(df_train.columns)))
print("Features test data: " + str(list(df_test.columns)))

Features training data: ['post', 'querytime', 'querytype', 'user_agent']
Features test data: ['id', 'post', 'querytime', 'querytype', 'user_agent']


In [None]:
#3
print('Querytime min (training): ' + str(df_train['querytime'].min()))
print('Querytime max (training): ' + str(df_train['querytime'].max()))
print()
print('Querytime min (test): ' + str(df_test['querytime'].min()))
print('Querytime max (test): ' + str(df_test['querytime'].max()))

In [6]:
#4
#merge train and test data AND reset the index
dataset_v1 = pd.concat([df_train, df_test]).reset_index(drop=True)    
dataset_v1.head(n=5)

Unnamed: 0,id,post,querytime,querytype,user_agent
0,,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:05:51+02:00,occupancy,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
1,,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:06:11+02:00,occupancy,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
2,,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:08:57+02:00,occupancy,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
3,,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:09:08+02:00,occupancy,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
4,,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:11:01+02:00,occupancy,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...


In [7]:
#5
dataset_v2 = dataset_v1.drop(['querytype','user_agent'], axis=1)
dataset_v2.head(n=5)

Unnamed: 0,id,post,querytime
0,,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:05:51+02:00
1,,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:06:11+02:00
2,,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:08:57+02:00
3,,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:09:08+02:00
4,,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:11:01+02:00


### 1b. Parsing the columns

1. In the id column replace the NaNs with -1
2. Select a row by id and have a closer look at the post column (json object)
    * check multiple rows since the post object doesn't always have the same fields
3. Iterate over the frame while keeping track of the possible fields in post => what is the set of ps
4. Write a function to extract a field of the json object 
5. For every field in the json objects create an additional column, finally drop the post column
6. The query time column is now interpreted as a string => convert to datetime, the id column can be cast to integer



    


In [8]:
#1
transformed_column = dataset_v2['id'].apply(lambda i: -1 if pd.isnull(i) else i)
dataset_v2['id'] = transformed_column
dataset_v2.head(n=3)

Unnamed: 0,id,post,querytime
0,-1.0,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:05:51+02:00
1,-1.0,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:06:11+02:00
2,-1.0,{'connection': 'http://irail.be/connections/88...,2016-07-27T20:08:57+02:00


In [9]:
#2
row = dataset_v2.iloc[2]
post_obj = row['post']
post_obj

{'connection': 'http://irail.be/connections/8813003/20160727/IC1518',
 'date': '7000',
 'from': 'http://irail.be/stations/NMBS/008813003',
 'occupancy': 'http://api.irail.be/terms/high',
 'to': 'http://irail.be/stations/NMBS/00',
 'vehicle': 'http://irail.be/vehicle/IC1518'}

In [10]:
#3
all_keys = []
for r in dataset_v2.iterrows():
    new_keys = list(r[1]['post'].keys())
    all_keys.extend(new_keys)
    
unique_keys = set(all_keys)
unique_keys

{'connection', 'date', 'from', 'occupancy', 'to', 'vehicle'}

In [12]:
#4
def extract_field(json, field):
    if field in json:
        return json[field]
    else:
        return None

In [13]:
#test
dataset_v2['post'].apply(lambda j: extract_field(j, 'connection')).head()

0    http://irail.be/connections/8813003/20160727/I...
1    http://irail.be/connections/8813003/20160727/I...
2    http://irail.be/connections/8813003/20160727/I...
3    http://irail.be/connections/8813003/20160727/I...
4    http://irail.be/connections/8813003/20160727/I...
Name: post, dtype: object

In [14]:
#5
for k in unique_keys:
    new_col = dataset_v2['post'].apply(lambda j: extract_field(j, k))
    dataset_v2[k] = new_col
    
dataset_v3 = dataset_v2.drop('post', axis=1)    
dataset_v3.head()    

Unnamed: 0,id,querytime,date,vehicle,connection,to,occupancy,from
0,-1.0,2016-07-27T20:05:51+02:00,Sun Jan 18 1970 01:14:03 GMT+0100 (CET),http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003
1,-1.0,2016-07-27T20:06:11+02:00,Sun Jan 18 1970 01:14:03 GMT+0100 (CET),http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003
2,-1.0,2016-07-27T20:08:57+02:00,7000,http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,http://irail.be/stations/NMBS/00,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003
3,-1.0,2016-07-27T20:09:08+02:00,7000,http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,http://irail.be/stations/NMBS/00,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003
4,-1.0,2016-07-27T20:11:01+02:00,11663,http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,http://irail.be/stations/NMBS/00,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003


In [15]:
#6
dataset_v3['querytime'] = pd.to_datetime(dataset_v3['querytime'])

#Later onderscheid tussen winter en zomeruur vermijden door niet naar UTC te gaan
#dataset_v3['querytime' = pd.to_datetime(dataset_v3['querytime'].apply(lambda t: t.split("+")[0]))

dataset_v3['id'] = pd.to_numeric(dataset_v3['id'], downcast='integer')


dataset_v3.dtypes

id                     int16
querytime     datetime64[ns]
date                  object
vehicle               object
connection            object
to                    object
occupancy             object
from                  object
dtype: object

### 1.3 And some more cleansing

- TASK: go over each of the columns, further parse them and pay close attention to NULL values
- HINT: don't mindlessly drop rows with missing data, the dataset has some level of redundancy so pay close attention!


1. How many NULL values per column?
2. Rework the vehicle column:
    - use the following information on the slides and at: https://nl.wikipedia.org/wiki/Lijst_van_treincategorie%C3%ABn_in_Belgi%C3%AB
    - extract connection type, train series, sequence number and you can also infer the direction of the train
    - keep checking for NULLs, some can be disguised (HINT: use the value_counts() function to check the distribution of values in a column)

4. connection
5. date
6. occupancy
7. to, from columns

In [17]:
#1

pd.isnull(dataset_v3['vehicle'])

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
5525    False
5526    False
5527    False
5528    False
5529    False
5530    False
5531    False
5532    False
5533    False
5534    False
5535    False
5536    False
5537    False
5538    False
5539    False
5540    False
5541    False
5542    False
5543    False
5544    False
5545    False
5546    False
5547    False
5548    False
5549    False
5550    False
5551    False
5552    False
5553    False
5554    False
Name: vehicle, Length: 5555, dtype: bool

In [18]:
for col in dataset_v3.columns:
    print(col + "\t"+ str(dataset_v3[pd.isnull(dataset_v3[col])].shape[0]))

id	0
querytime	0
date	0
vehicle	0
connection	0
to	12
occupancy	493
from	0


In [19]:
dataset_v3[pd.isnull(dataset_v3['to'])]

Unnamed: 0,id,querytime,date,vehicle,connection,to,occupancy,from
0,-1,2016-07-27 18:05:51,Sun Jan 18 1970 01:14:03 GMT+0100 (CET),http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003
1,-1,2016-07-27 18:06:11,Sun Jan 18 1970 01:14:03 GMT+0100 (CET),http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003
21,-1,2016-07-27 18:31:32,20160727,http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/medium,http://irail.be/stations/NMBS/008813003
22,-1,2016-07-27 18:31:55,20160727,http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/medium,http://irail.be/stations/NMBS/008813003
23,-1,2016-07-27 18:32:59,20160727,http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/medium,http://irail.be/stations/NMBS/008813003
24,-1,2016-07-27 18:35:51,20160727,http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003
25,-1,2016-07-27 18:35:58,20160727,http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003
26,-1,2016-07-27 18:36:02,20160727,http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003
27,-1,2016-07-27 18:36:05,20160727,http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003
28,-1,2016-07-27 18:36:15,20160727,http://irail.be/vehicle/IC1518,http://irail.be/connections/8813003/20160727/I...,,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008813003


In [22]:
#2
def remove_url_prefix(s):
    return s.split('/')[-1]

dataset_v3['vehicle'] = dataset_v3['vehicle'].apply(remove_url_prefix)
dataset_v3['vehicle'].value_counts()

(null)      951
IC1515       43
IC1518       36
IC429        33
IC407        31
P7305        30
IC1807       24
P7013        23
IC1530       23
IC539        23
IC716        22
S68574       22
S61555       22
IC1534       21
IC3631       20
IC1507       20
IC1839       19
IC437        19
IC1514       19
IC1509       18
P7444        18
IC3328       18
L557         17
IC408        17
IC515        17
IC4516       17
S11757       16
IC3639       16
IC3604       16
S83978       16
           ... 
IC3835        1
L2892         1
IC822         1
IC617         1
IC2917        1
S86589        1
IC717         1
L2579         1
THA9308       1
IC2422        1
S86576        1
S53364        1
EXT12420      1
IC3111        1
IC1711        1
L2764         1
IC2840        1
S206467       1
S83957        1
ICE11         1
IC419         1
IC1719        1
L5565         1
IC3037        1
IC5385        1
IC3844        1
P7489         1
IC2337        1
S53379        1
S23786        1
Name: vehicle, Length: 1

In [23]:
def remove_irregulars(s):
    if pd.isnull(s):
        return s
    else:
        if s=='(null)':
            return None
        else:
            return s

dataset_v3['vehicle'] = dataset_v3['vehicle'].apply(remove_irregulars)
dataset_v3['vehicle'].value_counts()
dataset_v3[pd.isnull(dataset_v3['vehicle'])].head(n=5)

Unnamed: 0,id,querytime,date,vehicle,connection,to,occupancy,from
3382,-1,2016-12-02 15:32:02,20161202,,http://irail.be/connections/8833001/20161202/I...,http://irail.be/stations/NMBS/008812005,http://api.irail.be/terms/medium,http://irail.be/stations/NMBS/008833001
3383,-1,2016-12-02 15:34:07,20161202,,http://irail.be/connections/8812005/20161202/I...,http://irail.be/stations/NMBS/008833001,http://api.irail.be/terms/high,http://irail.be/stations/NMBS/008812005
3568,-1,2016-12-09 06:22:34,20161209,,http://irail.be/connections/8814001/20161209/P...,http://irail.be/stations/NMBS/008813003,http://api.irail.be/terms/low,http://irail.be/stations/NMBS/008814001
3570,-1,2016-12-09 06:31:41,20161209,,http://irail.be/connections/8813003/20161209/P...,http://irail.be/stations/NMBS/008812005,http://api.irail.be/terms/low,http://irail.be/stations/NMBS/008813003
3571,-1,2016-12-09 06:31:48,20161209,,http://irail.be/connections/8812005/20161209/P...,http://irail.be/stations/NMBS/008811007,http://api.irail.be/terms/low,http://irail.be/stations/NMBS/008812005


In [26]:
#INZICHT vehicle types zitten ook in connection => daar extraheren

dataset_v3.iloc[0]['connection']

'http://irail.be/connections/8813003/20160727/IC1518'

In [27]:
def extract_train_series_id(r):
    series_number = int(re.sub(r"[a-zA-Z]", "", r)) 
    return series_number % 100 

def extract_train_series(r):
    series_number = int(re.sub(r"[a-zA-Z]", "", r)) 
    return series_number // 100 * 100
               
def extract_train_direction(r):
    series_id = extract_train_series_id(r)
    series_div = (series_id // 25) % 2
    return 'away' if series_div == 0 else 'back'



def extract_train_type(r):
    return re.sub(r"[0-9]", "", r.upper())

In [28]:
#3

In [29]:
dataset_v4 = dataset_v3.drop(['vehicle','date'],axis=1)
dataset_v4['vehicle'] = dataset_v4['connection'].apply(lambda r: r.split('/')[-1])
dataset_v4['vehicle'].value_counts()

IC1515     69
IC1518     48
IC407      35
IC429      35
IC539      34
P7305      33
S61555     27
IC1807     27
IC1534     26
S32266     25
IC716      25
S86567     24
IC1530     24
IC4516     23
P7013      23
S68574     22
IC507      21
IC1839     21
IC1507     21
S83978     21
IC437      20
IC3631     19
IC515      19
IC1509     19
P7444      19
IC1514     18
IC408      18
IC415      18
IC3328     18
IC529      17
           ..
IC9248      1
S23661      1
S83988      1
IC3013      1
IC929       1
IC2242      1
S11983      1
IC1743      1
IC3312      1
IC2317      1
IC420       1
IC2441      1
THA9351     1
IC628       1
IC12614     1
IC3209      1
P7091       1
L2887       1
IC1912      1
IC3408      1
L2485       1
S102077     1
L5593       1
S11763      1
S73460      1
L4882       1
IC2612      1
IC4340      1
IC3730      1
IC3240      1
Name: vehicle, Length: 1536, dtype: int64

In [30]:
dataset_v4['train_series']    = dataset_v4['vehicle'].apply(extract_train_series)
dataset_v4['train_direction'] = dataset_v4['vehicle'].apply(extract_train_direction)
dataset_v4['train_type']      = dataset_v4['vehicle'].apply(extract_train_type)


dataset_v5 = dataset_v4.drop(['vehicle','connection'], axis=1)

In [31]:
#4 date can just be dropped, querytime is much easier to work with

In [32]:
#5
dataset_v5['occupancy'] = dataset_v5['occupancy'].apply(lambda r: r.split('/')[-1] if pd.notnull(r) else None)
dataset_v5

Unnamed: 0,id,querytime,to,occupancy,from,train_series,train_direction,train_type
0,-1,2016-07-27 18:05:51,,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
1,-1,2016-07-27 18:06:11,,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
2,-1,2016-07-27 18:08:57,http://irail.be/stations/NMBS/00,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
3,-1,2016-07-27 18:09:08,http://irail.be/stations/NMBS/00,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
4,-1,2016-07-27 18:11:01,http://irail.be/stations/NMBS/00,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
5,-1,2016-07-27 18:11:50,http://irail.be/stations/NMBS/00,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
6,-1,2016-07-27 18:12:47,http://irail.be/stations/NMBS/00,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
7,-1,2016-07-27 18:13:23,http://irail.be/stations/NMBS/00,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
8,-1,2016-07-27 18:15:38,http://irail.be/stations/NMBS/00,medium,http://irail.be/stations/NMBS/008813003,1500,away,IC
9,-1,2016-07-27 18:16:51,http://irail.be/stations/NMBS/00,medium,http://irail.be/stations/NMBS/008813003,1500,away,IC


In [33]:
#6
dataset_v5['from'].describe()

count                                        5555
unique                                        391
top       http://irail.be/stations/NMBS/008892007
freq                                          492
Name: from, dtype: object

In [34]:
dataset_v5['to'].describe()

count                                        5543
unique                                        400
top       http://irail.be/stations/NMBS/008814001
freq                                          470
Name: to, dtype: object

In [35]:
mask = pd.isnull(dataset_v5['to'])
dataset_v5[mask]

Unnamed: 0,id,querytime,to,occupancy,from,train_series,train_direction,train_type
0,-1,2016-07-27 18:05:51,,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
1,-1,2016-07-27 18:06:11,,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
21,-1,2016-07-27 18:31:32,,medium,http://irail.be/stations/NMBS/008813003,1500,away,IC
22,-1,2016-07-27 18:31:55,,medium,http://irail.be/stations/NMBS/008813003,1500,away,IC
23,-1,2016-07-27 18:32:59,,medium,http://irail.be/stations/NMBS/008813003,1500,away,IC
24,-1,2016-07-27 18:35:51,,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
25,-1,2016-07-27 18:35:58,,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
26,-1,2016-07-27 18:36:02,,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
27,-1,2016-07-27 18:36:05,,high,http://irail.be/stations/NMBS/008813003,1500,away,IC
28,-1,2016-07-27 18:36:15,,high,http://irail.be/stations/NMBS/008813003,1500,away,IC


In [36]:
dataset_v6 = dataset_v5.drop(dataset_v5[mask].index)

In [None]:
dataset_v6['from'].describe()


In [None]:
dataset_v6['from'] = dataset_v6['from'].apply(lambda r: r.strip().split('/')[-1])
dataset_v6['to'] = dataset_v6['to'].apply(lambda r: r.strip().split('/')[-1])


In [None]:
mask = dataset_v6['from'].str.len() != 9
print(dataset_v6[mask].shape)
dataset_v6[mask]


In [None]:
dataset_v7 = dataset_v6.drop(dataset_v6[mask].index)
dataset_v7.shape

In [None]:
dataset_v7.loc[32,'to'] = "00" +  dataset_v7.loc[32,'to']  

In [None]:
mask = dataset_v7['to'].str.len() != 9
print(dataset_v7[mask].shape)
dataset_v7[mask]

In [None]:
dataset_v8 = dataset_v7.drop(dataset_v7[mask].index)
dataset_v8.shape

In [None]:
dataset_v8['from'] = dataset_v8['from'].apply(lambda t: t[2:])
dataset_v8['to'] = dataset_v8['to'].apply(lambda t: t[2:])


In [None]:
print(dataset_v8[dataset_v8['id'] == -1].shape)
print(dataset_v8[dataset_v8['id'] != -1].shape)

dataset_v8.to_csv('json_cleaned.csv', header=True, sep=',', index=False)