# Analysis-6 Design Your Own Question

Now that you’ve found the answers to the questions above, design two of your own questions to answer. These should be sufficiently difficult, and you should be creative! You should start with a question, and then propose a predicted answer or hypothesis before writing a MapReduce job to answer it. If you come up with a particularly challenging question, it can count for two (ask first).

# Prediction About Upcoming New Reddit User Numbers (Apply ML Model) 

Based on the given data set, we will find the unique user numbers in reddit for each month and then we will make prediction for upcoming month's unique user numbers. (We will use ML models) By doing this reddit development team can prepare themselves for upcoming load.

First things first. Lets calculate Reddit User Numbers for each month in the past. 

In [1]:
%%time
spark.catalog.dropGlobalTempView("Comments")
#df = sqlContext.read.json("hdfs://orion11:11001/sampled_reddit_v2/*")
df = sqlContext.read.json("hdfs://orion11:11001/sampled_reddit/*")
df.createGlobalTempView("Comments")

CPU times: user 24.4 ms, sys: 7.9 ms, total: 32.3 ms
Wall time: 4min 9s


We filtered uncessary comments and cleaned our data as below:

In [3]:
%%time
botExpr = "[bB][oO][tT]"

filteredComment = (df
                   .filter(~(df.body.like("[deleted]") 
                             | df.body.like('[removed]') 
                             | df.author.rlike(botExpr) 
                             | df.author.like("[deleted]")
                            )
                          )
                  )
df.unpersist()
filteredComment.cache()
print(filteredComment.count())

274394659
CPU times: user 17.6 ms, sys: 6.28 ms, total: 23.9 ms
Wall time: 1min 12s


### Add month and year column into the dataset

In [4]:
%%time
from pyspark.sql.functions import year, month, dayofmonth, from_unixtime
from pyspark.sql.types import DateType

filteredComment = (filteredComment
      .withColumn("year", year(from_unixtime("created_utc").cast(DateType())))
      .withColumn("month",month(from_unixtime("created_utc").cast(DateType()))))
filteredComment.count()

CPU times: user 57.3 ms, sys: 18.3 ms, total: 75.7 ms
Wall time: 1min 22s


274394659

### Count the number of active user for each month and year

In [154]:
%%time
from pyspark.sql.functions import countDistinct

userByTime = filteredComment.groupBy('month','year').agg(countDistinct('author').alias('NumOfUser'))

CPU times: user 4.79 ms, sys: 1.11 ms, total: 5.9 ms
Wall time: 139 ms


In [156]:
%%time
import pyspark.sql.functions as func
userByTime = userByTime.sort(func.asc('year'), func.asc('month'))
userByTime.show()

#for row in userByTime.rdd.collect():
#    print(row)

+-----+----+---------+
|month|year|NumOfUser|
+-----+----+---------+
|   12|2005|       68|
|    1|2006|      196|
|    2|2006|      416|
|    3|2006|      532|
|    4|2006|      668|
|    5|2006|      839|
|    6|2006|      858|
|    7|2006|     1097|
|    8|2006|     1414|
|    9|2006|     1449|
|   10|2006|     1606|
|   11|2006|     1844|
|   12|2006|     1947|
|    1|2007|     2263|
|    2|2007|     2521|
|    3|2007|     2986|
|    4|2007|     3038|
|    5|2007|     3621|
|    6|2007|     3690|
|    7|2007|     4051|
+-----+----+---------+
only showing top 20 rows

CPU times: user 13.7 ms, sys: 6.84 ms, total: 20.6 ms
Wall time: 1min 48s


In [157]:
userByTime.coalesce(1).write.format('csv').save('hdfs://orion11:31001/A6_3')

# Existing User Numbers from 2005 to 2017

In the first part of our analysis, we calculated the Reddit User Numbers from 2005 to 2007.

![alt text](https://i.imgur.com/8gC0qtm.jpg "Logo Title Text 1")


# Our Machine Learning Model (Linear Regression Model)

Based on the graph above, we decided to use Linear Regression ML Model. We eliminated the first 55 months to be able make a better prediction. Because for first 55 months user numbers are relatively very low as you seen in the above graph. 
We fit our ML.csv data into LinearRegression model. 

Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on – the kind of relationship between dependent and independent variables, they are considering and the number of independent variables being used. 


In [11]:
%%time
import pandas as pd
    
# Load data
melbourne_file_path = '/home4/saozdamar/ML.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
# Filter rows with missing values
#melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = melbourne_data.Number
melbourne_features = ['Month','Year']
#melbourne_features = ['RL']
X = melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

CPU times: user 7.96 ms, sys: 8.34 ms, total: 16.3 ms
Wall time: 1.76 s


After we fit our current data into linear regression model. In the below, we are passing our desired month and year value. (month=1 January, month=2 February ,so on...) 

In [12]:
%%time
#ML.csv (month,year,userNumber)
import numpy as np
np_arr1 = np.array([[8, 2017],[9, 2017],[5, 2020]])
#np_arr1 = np.array([[137],[200]])

from sklearn.linear_model import LinearRegression
#from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

#forest_model = RandomForestRegressor(n_estimators=100,random_state=1)
forest_model = LinearRegression()

forest_model.fit(train_X, train_y)
#forest_model.fit(X,y)

val_X=np.append(val_X,np_arr1,axis=0)  
print(val_X)


#val_X=val_X.insert([4],[2017])
melb_preds = forest_model.predict(val_X)
print(melb_preds)
#print('Mean_absolute_error:',mean_absolute_error(val_y, melb_preds))
#print(val_X.dtypes)



[[   5 2012]
 [  10 2012]
 [   8 2015]
 [   8 2011]
 [   6 2016]
 [   9 2016]
 [   1 2013]
 [   2 2015]
 [  12 2014]
 [   9 2012]
 [   9 2014]
 [   1 2014]
 [   7 2014]
 [   4 2013]
 [   8 2016]
 [   9 2010]
 [   4 2015]
 [   1 2016]
 [  12 2013]
 [  10 2010]
 [   4 2016]
 [   8 2017]
 [   9 2017]
 [   5 2020]]
[ 420218.81944352  511173.22785372 1178645.96654123  240173.29713911
 1376882.37052768 1431455.0155738   582073.46006596 1069500.67644906
 1016791.32591879  492982.34617168  962218.68087274  816691.62741649
  925836.91750866  636646.10511202 1413264.13389176   23746.01147062
 1105882.43981308 1285927.96211755  782173.15856826   41936.89315265
 1340500.60716361 1647882.30124229 1666073.18292433 2297164.15824777]
CPU times: user 5.78 ms, sys: 5.57 ms, total: 11.4 ms
Wall time: 823 ms


# Our Results

You can see our predictions for user numbers in next 6, 7 months and for May 2020. Expectied Reddit user numbers are : 

After 6 months-Sep 2017 : 1647882 

After 7 months-Sep 2017 : 1666073

After 3 years -May 2020 : 2297164 

VERY IMPORTANT NOTE: We used %1 sampled data. So we need to multiply these predictions by 100. So actual predictions should be like this: 

After 6 months-Sep 2017 : 164788200 

After 7 months-Sep 2017 : 166607300

After 3 years -May 2020 : 229716400 


![alt text](https://i.imgur.com/rHefKM2.jpg "Logo Title Text 1")

# References





1-) https://www.kaggle.com/dansbecker/random-forests