# 北京市空气PM2.5值预测

## 一、分析目标
**通过北京市历史24小时天气数据，预测之后12小时天气数据中PM2.5的值**

## 二、项目背景

**在北京，冬天最令人头疼的就是雾霾问题，每当雾霾天气来临，那种灰蒙蒙的空气和将口鼻掩盖在厚厚的口罩下呼吸困难的感觉，让人情绪低落。而雾霾的罪魁祸首就是PM2.5。**

**本次分析主要是想要使用线性回归模型对PM2.5值进行预测**

**昨天已经用（北京）历史24小时即08/07日22时-08/0821时共24小时数据对LinearRegression模型进行了训练，今天先来采集新生成的数据做测试数据集，由于已经过去了12个小时，所以共有12组数据可用**

## 三 数据来源

**本次分析数据来自心知天气网，该网站可以通过Restful风格URL直接获取Json格式气象和大气数据，获取方式较简单。**

## 四 数据分析及预测

### 1. 数据规整

In [61]:
import pandas as pd
from io import StringIO
from urllib import request
import json
from dateutil.parser import parse

**从网络获取数据并进行整理**

In [62]:
url_beijing_w = 'https://api.seniverse.com/v3/weather/hourly_history.json?key=Sz6GmmiQ6SAjYTKbc&location=beijing&language=zh-Hans&unit=c'
url_beijing_p = 'https://api.seniverse.com/v3/air/hourly_history.json?key=Sz6GmmiQ6SAjYTKbc&location=beijing&language=zh-Hans&scope=city'

In [63]:
s_p = request.urlopen(url_beijing_p).read().decode('utf8')
s_w = request.urlopen(url_beijing_w).read().decode('utf8')
data_dict_p = json.loads(s_p)
data_dict_w = json.loads(s_w)

In [64]:
def gen_table_p(list2):
    data_dict = {}
    for i, value in enumerate(list2):
        data_dict[i] = value['city']
    return data_dict

In [65]:
def gen_table_w(list2):
    data_dict = {}
    for i, value in enumerate(list2):
        data_dict[i] = value
    return data_dict

**将气象和大气污染物数据转换成DataFrame表格**

In [66]:
data_list_p = data_dict_p['results'][0]['hourly_history']
data_list_p = gen_table(data_list_p)
data_p = pd.DataFrame(data_list_p)

In [67]:
data_list_w = data_dict_w['results'][0]['hourly_history']
data_list_w = gen_table_w(data_list_w)
data_w = pd.DataFrame(data_list_w)

In [68]:
data_p = data_p.T
data_w = data_w.T

**调整时间格式，删除不要的特征变量**

In [69]:
def adjust_time(data):
    time = data['last_update'].astype(str)
    time = time.str[:19]
    time = time.str.replace('T', ' ')
    time = time.map(lambda x : parse(x))
    time = time.dt.strftime('%H-%m/%d')
    data['last_update'] = time
    return data

In [70]:
data_p = adjust_time(data_p)
data_w = adjust_time(data_w)

In [71]:
data = pd.merge(data_p, data_w, on = 'last_update')

In [72]:
data_test = data[:12]

In [74]:
data_test = data_test.drop(['dew_point', 'wind_direction', 'wind_direction_degree', 'text', 'code', 'wind_scale','last_update', 'quality'], axis = 1)

In [75]:
data_test

Unnamed: 0,aqi,co,no2,o3,pm10,pm25,so2,clouds,feels_like,humidity,pressure,temperature,visibility,wind_speed
0,78,0.817,29,61,61,57,1,50,28,77,1001,28,3.9,11.16
1,80,0.85,29,49,59,59,1,50,27,79,1001,27,3.7,8.64
2,83,0.842,30,39,62,61,2,50,26,83,1001,27,3.1,8.64
3,84,0.817,28,44,61,62,2,50,25,85,1001,26,3.1,6.12
4,80,0.792,29,49,62,59,2,50,25,86,1001,25,3.1,9.0
5,80,0.775,28,54,65,59,2,50,26,82,1001,26,3.5,9.0
6,75,0.767,27,61,64,55,2,50,26,79,1001,27,4.6,9.72
7,69,0.808,25,73,65,50,2,50,27,79,1002,27,4.3,9.36
8,67,0.817,25,81,69,48,2,50,27,79,1002,27,4.1,9.0
9,62,0.767,26,88,67,44,2,50,27,79,1002,27,5.1,7.56


### 2.生成训练数据集合测试数据集
**数据备份，生成训练数据集和测试数据集**

In [76]:
data_test.to_excel('D:/python/practise/sample/weather/data_test(22-09).xlsx')

In [85]:
data_train = pd.read_excel('D:/python/practise/sample/weather/data_all(22-21).xlsx')
data_test = pd.read_excel('D:/python/practise/sample/weather/data_test(22-09).xlsx')

In [87]:
data_train = data_train.drop(['Unnamed: 0', 'last_update', 'station'], axis = 1)
data_test = data_test.drop('Unnamed: 0', axis = 1)

In [88]:
y_train = data_train['pm25'].values
x_train = data_train.drop('pm25', axis = 1).values

**把PM2.5提取出来做真实值**

In [89]:
y_true = data_test['pm25'].values
x_test = data_test.drop('pm25', axis = 1).values

**先用普通线性回归模型预测**

In [90]:
from sklearn.linear_model import LinearRegression

In [91]:
model = LinearRegression()

In [92]:
model.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [93]:
y_esti = model.predict(x_test)

**这是预测结果值**

In [94]:
y_esti

array([42.58293092, 44.55257971, 43.7785975 , 42.5218313 , 39.48083543,
       41.79761404, 39.50294654, 41.746076  , 42.1306705 , 39.09266014,
       34.90946585, 32.29064508])

**这是真实值**

In [95]:
y_true

array([57, 59, 61, 62, 59, 59, 55, 50, 48, 44, 42, 39], dtype=int64)

**用残差平方和查看预测效果，公式如下：**$$score = \frac{1}{n}\sum_1^n(y_i-y^*)^2$$

In [97]:
((y_esti - y_true)**2).mean()

185.9652934007928

**再来试试岭回归模型**

In [121]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV

In [116]:
reg = Ridge(alpha = 0.01)

In [117]:
reg.fit(x_train, y_train)

Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [118]:
y_esti_reg = reg.predict(x_test)

In [119]:
((y_esti_reg - y_true)**2).mean()

186.24165721215715

In [122]:
reg_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])

In [123]:
reg_cv.fit(x_train, y_train)

RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=None, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)

In [124]:
y_esti_regcv = reg_cv.predict(x_test)

In [125]:
((y_esti_regcv - y_true)**2).mean()

188.67034918027025

**再用Lasso回归试试**

In [126]:
from sklearn.linear_model import Lasso

In [139]:
lasso = Lasso(alpha = 0.01)

In [140]:
lasso.fit(x_train, y_train)

Lasso(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [141]:
y_esti_lasso = lasso.predict(x_test)

In [142]:
((y_esti_lasso - y_true)**2).mean()

186.51258366387617

## 五、结论
**预测值有一定偏差，但基本反映了变化趋势。进过几个模型的筛选，最终还是普通线性回归模型效果稍好。**