# Assembling data from all files

We have 5 files that were cleaned and written in new csv files: 

* BeijingPM_cleaned.csv
* ChengduPM_cleaned.csv
* GuangzhouPM_cleaned.csv
* ShanghaiPM_cleaned.csv
* ShenyangPM_cleaned.csv

I will merge all these files into one big file and save it

In [1]:
import os
import pandas as pd

In [2]:
def load_data_from_csv(data_path):
    return pd.read_csv(data_path)

In [5]:
MAIN_DIR_PATH = '../Prep_FiveCitiePMData'
cities_data_path_list = os.listdir(MAIN_DIR_PATH)

In [6]:
print(*cities_data_path_list, sep='\n')

BeijingPM_cleaned.csv
ChengduPM_cleaned.csv
GuangzhouPM_cleaned.csv
ShanghaiPM_cleaned.csv
ShenyangPM_cleaned.csv


## New feature addition: City

So, when I was looking at data I just realized that we actually have additional feature which is the name of City

We have 5 files, why can't we take name of files (or cities) and use them as features

In [7]:
def append_all_data(data_paths):
    data = pd.DataFrame()    
    for d_path in data_paths:
        data_path = os.path.join(MAIN_DIR_PATH, d_path)
        print(d_path[:-14])
        data_frame = load_data_from_csv(data_path)
        data_frame['City'] = d_path[:-14]
        data = data.append(data_frame, ignore_index=True)
    return data

**Get all files**

In [8]:
data = append_all_data(cities_data_path_list)

Beijing
Chengdu
Guangzhou
Shanghai
Shenyang


In [10]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,No,year,month,day,hour,season,DEWP,HUMI,PRES,TEMP,cbwd,Iws,precipitation,Iprec,PM,City
0,0,1,2010,1,1,23,4.0,-17.0,41.0,1020.0,-5.0,cv,0.89,0.0,0.0,129.0,Beijing
1,1,2,2010,1,2,0,4.0,-16.0,38.0,1020.0,-4.0,SE,1.79,0.0,0.0,148.0,Beijing
2,2,3,2010,1,2,1,4.0,-15.0,42.0,1020.0,-4.0,SE,2.68,0.0,0.0,159.0,Beijing
3,3,4,2010,1,2,2,4.0,-11.0,63.5,1021.0,-5.0,SE,3.57,0.0,0.0,181.0,Beijing
4,4,5,2010,1,2,3,4.0,-7.0,85.0,1022.0,-5.0,SE,5.36,0.0,0.0,138.0,Beijing
5,5,6,2010,1,2,4,4.0,-7.0,85.0,1022.0,-5.0,SE,6.25,0.0,0.0,109.0,Beijing
6,6,7,2010,1,2,5,4.0,-7.0,92.0,1022.0,-6.0,SE,7.14,0.0,0.0,105.0,Beijing
7,7,8,2010,1,2,6,4.0,-7.0,92.0,1023.0,-6.0,SE,8.93,0.0,0.0,124.0,Beijing
8,8,9,2010,1,2,7,4.0,-7.0,85.0,1024.0,-5.0,SE,10.72,0.0,0.0,120.0,Beijing
9,9,10,2010,1,2,8,4.0,-8.0,85.0,1024.0,-6.0,SE,12.51,0.0,0.0,132.0,Beijing


**We don't need Unnamed:0, No columns**. We will drop them

In [11]:
data.columns

Index(['Unnamed: 0', 'No', 'year', 'month', 'day', 'hour', 'season', 'DEWP',
       'HUMI', 'PRES', 'TEMP', 'cbwd', 'Iws', 'precipitation', 'Iprec', 'PM',
       'City'],
      dtype='object')

In [12]:
data = data.drop(['Unnamed: 0', 'No'], axis=1)

In [13]:
data.columns

Index(['year', 'month', 'day', 'hour', 'season', 'DEWP', 'HUMI', 'PRES',
       'TEMP', 'cbwd', 'Iws', 'precipitation', 'Iprec', 'PM', 'City'],
      dtype='object')

In [14]:
data.head()

Unnamed: 0,year,month,day,hour,season,DEWP,HUMI,PRES,TEMP,cbwd,Iws,precipitation,Iprec,PM,City
0,2010,1,1,23,4.0,-17.0,41.0,1020.0,-5.0,cv,0.89,0.0,0.0,129.0,Beijing
1,2010,1,2,0,4.0,-16.0,38.0,1020.0,-4.0,SE,1.79,0.0,0.0,148.0,Beijing
2,2010,1,2,1,4.0,-15.0,42.0,1020.0,-4.0,SE,2.68,0.0,0.0,159.0,Beijing
3,2010,1,2,2,4.0,-11.0,63.5,1021.0,-5.0,SE,3.57,0.0,0.0,181.0,Beijing
4,2010,1,2,3,4.0,-7.0,85.0,1022.0,-5.0,SE,5.36,0.0,0.0,138.0,Beijing


In [18]:
DATASET_DIR = '../PREP_DATASET'

In [19]:
file_path = os.path.join(DATASET_DIR, 'PREP_PM_DATASET.csv')
data.to_csv(file_path)

## DONE