<p style="text-align:center">
        <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo">
</p>


### Analyse search terms on the e-commerce web server


##### In this assignment you will download the search term data set for the e-commerce web server and run analytic queries on it.


In [1]:
# Install spark
!pip install pyspark
!pip install findspark



In [2]:
import findspark
findspark.init()

In [3]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

In [4]:
# Start session
spark = SparkSession.builder.appName("Saving and Loading a SparkML Model").getOrCreate()

23/04/15 17:40:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/15 17:40:47 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [8]:
# Download The search term dataset from the below url
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/searchterms.csv

--2023-04-15 17:36:52--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/searchterms.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 233457 (228K) [text/csv]
Saving to: ‘searchterms.csv’


2023-04-15 17:36:52 (43.9 MB/s) - ‘searchterms.csv’ saved [233457/233457]



In [5]:
# Load the csv into a spark dataframe
df = spark.read.format("csv").option("header", "true").load("searchterms.csv")

In [6]:
df.show()

+---+-----+----+--------------+
|day|month|year|    searchterm|
+---+-----+----+--------------+
| 12|   11|2021| mobile 6 inch|
| 12|   11|2021| mobile latest|
| 12|   11|2021|   tablet wifi|
| 12|   11|2021|laptop 14 inch|
| 12|   11|2021|     mobile 5g|
| 12|   11|2021| mobile 6 inch|
| 12|   11|2021|        laptop|
| 12|   11|2021|        laptop|
| 12|   11|2021|     mobile 5g|
| 12|   11|2021|   tablet wifi|
| 12|   11|2021|     mobile 5g|
| 12|   11|2021| gaming laptop|
| 12|   11|2021|     mobile 5g|
| 12|   11|2021|     mobile 5g|
| 12|   11|2021| mobile 6 inch|
| 12|   11|2021| mobile latest|
| 12|   11|2021| mobile 6 inch|
| 12|   11|2021|   tablet wifi|
| 12|   11|2021|     mobile 5g|
| 12|   11|2021|        laptop|
+---+-----+----+--------------+
only showing top 20 rows



In [8]:
# Print the number of rows and columns
# Take a screenshot of the code and name it as shape.jpg)
print("Number of rows: {}".format(df.count()))
print("Number of columns: {}".format(len(df.columns)))

Number of rows: 10000
Number of columns: 4


In [9]:
# Print the top 5 rows
# Take a screenshot of the code and name it as top5rows.jpg)
df.show(5)

+---+-----+----+--------------+
|day|month|year|    searchterm|
+---+-----+----+--------------+
| 12|   11|2021| mobile 6 inch|
| 12|   11|2021| mobile latest|
| 12|   11|2021|   tablet wifi|
| 12|   11|2021|laptop 14 inch|
| 12|   11|2021|     mobile 5g|
+---+-----+----+--------------+
only showing top 5 rows



In [10]:
# Find out the datatype of the column searchterm?
# Take a screenshot of the code and name it as datatype.jpg)
df.printSchema()

root
 |-- day: string (nullable = true)
 |-- month: string (nullable = true)
 |-- year: string (nullable = true)
 |-- searchterm: string (nullable = true)



In [12]:
# How many times was the term `gaming laptop` searched?
# Take a screenshot of the code and name it as gaminglaptop.jpg)
gaming_laptop_count = df.filter(df.searchterm == "gaming laptop").count()
print(gaming_laptop_count)

499


In [13]:
# Print the top 5 most frequently used search terms?
# Take a screenshot of the code and name it as top5terms.jpg)
searchterm_counts = df.groupBy("searchterm").count()
top_searchterms = searchterm_counts.orderBy("count", ascending=False).limit(5)
top_searchterms.show()



+-------------+-----+
|   searchterm|count|
+-------------+-----+
|mobile 6 inch| 2312|
|    mobile 5g| 2301|
|mobile latest| 1327|
|       laptop|  935|
|  tablet wifi|  896|
+-------------+-----+



                                                                                

In [17]:
# The pretrained sales forecasting model is available at the below url
# https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/model.tar.gz
!tar xzf model.tar.gz

In [22]:
# Load the sales forecast model.
# Take a screenshot of the code and name it as loadmodel.jpg)
from pyspark.ml.regression import LinearRegressionModel

model = LinearRegressionModel.load('sales_prediction.model')

In [24]:
df.show(5)

+---+-----+----+--------------+
|day|month|year|    searchterm|
+---+-----+----+--------------+
| 12|   11|2021| mobile 6 inch|
| 12|   11|2021| mobile latest|
| 12|   11|2021|   tablet wifi|
| 12|   11|2021|laptop 14 inch|
| 12|   11|2021|     mobile 5g|
+---+-----+----+--------------+
only showing top 5 rows



In [21]:
# Using the sales forecast model, predict the sales for the year of 2023.
# Take a screenshot of the code and name it as forecast.jpg

def predict(year):
    assembler = VectorAssembler(inputCols=["year"], outputCol="sales")
    data = [[year,0]]
    columns = ["year", "height"]
    _ = spark.createDataFrame(data, columns)
    __ = assembler.transform(_).select('sales', 'height')
    predictions = model.transform(__)
    predictions.select('prediction').show()

NameError: name 'predict' is not defined

In [None]:
predict(2023)