**Sultan Arapov Data Analyst**

**Task:**
Next year I am going to spend my vacations in China. What month and what city should I go to? What will PM2.5 be in that period?

In [1]:
import pandas as pd
import numpy as np

In [11]:
beijing = pd.read_csv('BeijingPM20100101_20151231.csv')
chengdu = pd.read_csv('ChengduPM20100101_20151231.csv')
guanzhou = pd.read_csv('GuangzhouPM20100101_20151231.csv')
shanghai = pd.read_csv('ShanghaiPM20100101_20151231.csv')
shenyang = pd.read_csv('ShenyangPM20100101_20151231.csv') 

beijing.head()                                                                                                         

Unnamed: 0,No,year,month,day,hour,season,PM_Dongsi,PM_Dongsihuan,PM_Nongzhanguan,PM_US Post,DEWP,HUMI,PRES,TEMP,cbwd,Iws,precipitation,Iprec
0,1,2010,1,1,0,4,,,,,-21.0,43.0,1021.0,-11.0,NW,1.79,0.0,0.0
1,2,2010,1,1,1,4,,,,,-21.0,47.0,1020.0,-12.0,NW,4.92,0.0,0.0
2,3,2010,1,1,2,4,,,,,-21.0,43.0,1019.0,-11.0,NW,6.71,0.0,0.0
3,4,2010,1,1,3,4,,,,,-21.0,55.0,1019.0,-14.0,NW,9.84,0.0,0.0
4,5,2010,1,1,4,4,,,,,-20.0,51.0,1018.0,-12.0,NW,12.97,0.0,0.0


**Initial assumptions**

From the main statement it can`t be inferred what kind of temperature the person is more partial to, as many people quite possibly may prefer winter over summer. Therefore, we will refer to the mean temperature as a neutral factor, unless it is 
either quite cold or quite hot. 


**Logs**
1. No significant outliers were observed. 

2. Missing values were imputed according to the interpolation method.

3. With humility level being regarded as having a positive relationship with precipitation level, out of the two options only humility, i.e. 'HUMI', was left.

4. The best climate was determined as a minimum sum of values of 4 variabels, namely, humility, pressure, cumulated wind speed
and PM2.5 concentration. 

5. Out of all the options available July in Guanzhou was determined as having the best time to visit. PM2.5 concetration level was around 26.8, which in fact, was the lowest value among all the instances.

In [12]:
beijing.interpolate(inplace=True)
beijing['PM2.5'] = (beijing['PM_Dongsi']+beijing['PM_Dongsihuan']+beijing['PM_Nongzhanguan'] + beijing['PM_US Post'])/4
beijing.drop(['No', 'cbwd', 'Iprec',
          'precipitation', 'DEWP', 'year', 'day', 'hour', 'season',
              'PM_Dongsi', 'PM_Dongsihuan', 'PM_Nongzhanguan','PM_US Post'], 
         axis=1, inplace=True)
bei = beijing.groupby(['month']).mean()
# bei

chengdu.interpolate(inplace=True)
chengdu['PM2.5'] = (chengdu['PM_Caotangsi']+chengdu['PM_Shahepu']+chengdu['PM_US Post'])/3
chengdu.drop(['No', 'cbwd', 'Iprec',
          'precipitation', 'DEWP', 'year', 'day', 'hour', 'season',
              'PM_Caotangsi', 'PM_Shahepu', 'PM_US Post'], 
         axis=1, inplace=True)
cheng = chengdu.groupby(['month']).mean()
# cheng

guanzhou.interpolate(inplace=True)
guanzhou['PM2.5'] = (guanzhou['PM_City Station']+guanzhou['PM_5th Middle School']+guanzhou['PM_US Post'])/3
guanzhou.drop(['No', 'cbwd', 'Iprec',
          'precipitation', 'DEWP', 'year', 'day', 'hour', 'season',
              'PM_City Station', 'PM_5th Middle School', 'PM_US Post'], 
         axis=1, inplace=True)
guan = guanzhou.groupby(['month']).mean()
# guan

shanghai.interpolate(inplace=True)
shanghai['PM2.5'] = (shanghai['PM_Jingan']+shanghai['PM_US Post']+shanghai['PM_Xuhui'])/3
shanghai.drop(['No', 'cbwd', 'Iprec',
          'precipitation', 'DEWP', 'year', 'day', 'hour', 'season',
              'PM_Jingan', 'PM_US Post', 'PM_Xuhui'], 
         axis=1, inplace=True)
shan = shanghai.groupby(['month']).mean()
# shan

shenyang.interpolate(inplace=True)
shenyang['PM2.5'] = (shenyang['PM_Taiyuanjie']+shenyang['PM_US Post']+shenyang['PM_Xiaoheyan'])/3
shenyang.drop(['No', 'cbwd', 'Iprec',
          'precipitation', 'DEWP', 'year', 'day', 'hour', 'season',
              'PM_Taiyuanjie', 'PM_US Post', 'PM_Xiaoheyan'], 
         axis=1, inplace=True)
shen = shenyang.groupby(['month']).mean()
# shen

In [13]:
table = pd.concat([bei, cheng, guan, shen, shan], axis=0, 
                  keys=['Beijing', 'Chengdu', 'Guanzhou', 'Shenyang', 'Shanghai'],
                 names=['City', 'month'])

# import sklearn
# from sklearn import preprocessing
# table = pd.DataFrame(preprocessing.normalize(table, norm='l2'), index=table.index, columns = table.columns)
table['low'] = table['HUMI'] + table['PRES'] + table['Iws'] + table['PM2.5']
table.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,HUMI,PRES,TEMP,Iws,PM2.5,low
City,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Beijing,1,45.250112,1028.686828,-4.015457,32.212283,108.649866,1214.799088
Beijing,2,46.699088,1025.627342,-1.153107,19.961603,131.229714,1223.517747
Beijing,3,39.485999,1020.772625,6.446685,27.052825,106.685979,1193.997429
Beijing,4,41.914931,1014.363426,14.253009,31.613183,78.750405,1166.641944
Beijing,5,45.87948,1008.100134,21.334005,26.772328,72.517529,1153.269471


In [20]:
def city_month():
    a = table['low'].min()
    return table[table['low']==a]

print('It seems to be most appropriate to visit Guanzhou in July when PM2.5 level is {}'.format(city_month()['PM2.5'].values))
city_month()

It seems to be most appropriate to visit Guanzhou in July when PM2.5 level is [ 26.84664766]


Unnamed: 0_level_0,Unnamed: 1_level_0,HUMI,PRES,TEMP,Iws,PM2.5,low
City,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Guanzhou,7,79.015681,997.889247,28.447043,7.117966,26.846648,1110.869542
