# <p style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;font-size:150%;text-align:center;border-radius:10px 10px;border-style:solid;border-color:#d90b1c;">Recommendation system for H and M Fashion</p>

**For H and M Fashion EDA please check out my notebook** https://www.kaggle.com/nadianizam/h-m-fashion-eda

<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">Terminologies</h1>

There are certain terminologies which needs to be understood before moving forward.

**Apache Spark:** Apache Spark is an open-source distributed general-purpose cluster-computing framework.It can be used with Hadoop too.

**Collaborative filtering:** Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users. Consider example if a person A likes item 1, 2, 3 and B like 2,3,4 then they have similar interests and A should like item 4 and B should like item 1.

**Alternating least square(ALS) matrix factorization:** The idea is basically to take a large (or potentially huge) matrix and factor it into some smaller representation of the original matrix through alternating least squares. We end up with two or more lower dimensional matrices whose product equals the original one.ALS comes inbuilt in Apache Spark.

**PySpark:** PySpark is the collaboration of Apache Spark and Python. PySpark is the Python API for Spark.

<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">1.Initialize spark session</h1>

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
     |████████████████████████████████| 281.4 MB 35 kB/s               |██████████████████████████████▏ | 265.7 MB 15.3 MB/s eta 0:00:02
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=06513894fbccfb268ceb163451cb68a60b2493639b62b1c43771fbd090493538
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.2.1



<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">2-Load libraries</h1>

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType 
from pyspark.sql.types import ArrayType, DoubleType, BooleanType
from pyspark.sql.functions import col,array_contains
from pyspark.sql import SQLContext 
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import udf,col,when
from pyspark.sql.functions import to_timestamp,date_format
import numpy as np
import pandas as pd
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import *

sc = SparkSession.builder.appName("Recommendations").config("spark.sql.files.maxPartitionBytes", 5000000).getOrCreate()
spark = SparkSession(sc)



Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/04/29 17:39:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable



<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">3-Load Dataset in Apache Spark</h1>

In [3]:
transaction = spark.read.option("header",True) \
              .csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")
transaction.printSchema()

root
 |-- t_dat: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- article_id: string (nullable = true)
 |-- price: string (nullable = true)
 |-- sales_channel_id: string (nullable = true)



In [4]:
from pyspark.sql.functions import min, max
from pyspark.sql.functions import unix_timestamp, lit
min_date, max_date = transaction.select(min("t_dat"), max("t_dat")).first()
min_date, max_date

                                                                                

('2018-09-20', '2020-09-22')

<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">5-Select data for recommendation</h1>

In this transaction dataset we have 31,788,324 rows and 5 columns.Let's capture first what are the most recently bought articles.For recommendation I am selecting only date 2020-09-22 which is the last transaction date.</h1>

In [5]:

hm =  transaction.withColumn('t_dat', transaction['t_dat'].cast('string'))
hm = hm.withColumn('date', from_unixtime(unix_timestamp('t_dat', 'yyyy-MM-dd')))
hm = hm.withColumn('year', year(col('date')))
hm = hm.withColumn('month', month(col('date')))
hm = hm.withColumn('day', date_format(col('date'), "d"))

hm = hm[hm['year'] == 2020]
hm = hm[hm['month'] == 9]
hm = hm[hm['day'] == 22]
transaction.unpersist()

# Prepare the dataset
hm = hm.groupby('customer_id', 'article_id').count()
hm.show(5)



+--------------------+----------+-----+
|         customer_id|article_id|count|
+--------------------+----------+-----+
|00f7bc5c0df4c615b...|0780418013|    1|
|02094817e46f3b692...|0791587001|    1|
|0333e5dda0257e9f4...|0839332002|    2|
|07c7a1172caf8fb97...|0573085043|    1|
|081373184e601470c...|0714790020|    1|
+--------------------+----------+-----+
only showing top 5 rows



                                                                                

In [6]:
print((hm.count(), len(hm.columns)))



(29486, 3)


                                                                                

In [7]:
# Count the total number of article in the dataset
numerator = hm.select("count").count()

# Count the number of distinct customerid and distinct articleid
num_users = hm.select("customer_id").distinct().count()
num_articles = hm.select("article_id").distinct().count()

# Set the denominator equal to the number of customer multiplied by the number of articles
denominator = num_users * num_articles

# Divide the numerator by the denominator
sparsity = (1.0 - (numerator *1.0)/denominator)*100
print("Sparsity: ", "%.2f" % sparsity + "%.")



Sparsity:  99.96%.


                                                                                

In [8]:
userId_count = hm.groupBy("customer_id").count().orderBy('count', ascending=False)
userId_count.show()



+--------------------+-----+
|         customer_id|count|
+--------------------+-----+
|30b6056bacc5f5c9d...|   28|
|5e8fb4d457fdffc61...|   28|
|dc1b173e541f8d3c1...|   27|
|6335d496ef463bc40...|   25|
|1796e87fd2e88932b...|   25|
|f50287d9cf052d4b4...|   24|
|54e8ebd39543b5a4d...|   23|
|fd5ce8716faf00f6a...|   23|
|850ec77661a417d27...|   22|
|ad3663a848dccbdda...|   21|
|32f3a6a7ce63d302c...|   21|
|b606fe5786c00151a...|   21|
|298523b6637340717...|   21|
|b49647f84a99ced53...|   21|
|fc783381f1ea2174c...|   21|
|a08e284bb18add2d7...|   21|
|383e1b07e2c1fe169...|   21|
|3ca77aab50ae4532b...|   20|
|2a721767cd9864ed5...|   20|
|af5166e0f89b0d433...|   19|
+--------------------+-----+
only showing top 20 rows



                                                                                

In [9]:
articleId_count = hm.groupBy("article_id").count().orderBy('count', ascending=False)
articleId_count.show()



+----------+-----+
|article_id|count|
+----------+-----+
|0924243002|   91|
|0918522001|   88|
|0866731001|   78|
|0751471001|   75|
|0448509014|   73|
|0714790020|   72|
|0762846027|   68|
|0928206001|   67|
|0893432002|   66|
|0918292001|   65|
|0915529005|   64|
|0788575004|   63|
|0915529003|   63|
|0863583001|   60|
|0930380001|   59|
|0573085028|   59|
|0919273002|   58|
|0850917001|   57|
|0573085042|   56|
|0874110016|   53|
+----------+-----+
only showing top 20 rows



                                                                                

<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">5-Importing important modules</h1>

In [10]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS


<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">6-Converting String to index</h1>

Before making an ALS model it needs to be clear that ALS only accepts integer value as parameters. Hence we need to convert customer_id and article_id column in index form.

In [11]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in list(set(hm.columns)-set(['count'])) ]
pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(hm).transform(hm)
transformed.show()

22/04/29 18:03:06 WARN DAGScheduler: Broadcasting large task binary with size 1207.9 KiB


+--------------------+----------+-----+-----------------+----------------+
|         customer_id|article_id|count|customer_id_index|article_id_index|
+--------------------+----------+-----+-----------------+----------------+
|00f7bc5c0df4c615b...|0780418013|    1|            783.0|          2237.0|
|02094817e46f3b692...|0791587001|    1|            785.0|            35.0|
|0333e5dda0257e9f4...|0839332002|    2|           4098.0|           732.0|
|07c7a1172caf8fb97...|0573085043|    1|           1702.0|            44.0|
|081373184e601470c...|0714790020|    1|           4146.0|             5.0|
|09bec2a61046ccbea...|0860336002|    1|           6792.0|          2368.0|
|0be4f1ecce204ee32...|0573085028|    1|            799.0|            14.0|
|0c4b30343292b5101...|0918522001|    1|           6825.0|             1.0|
|0e10e02358875468b...|0579541001|    1|           2689.0|            53.0|
|0fc371e67e61a31d7...|0907170001|    1|           1737.0|          1978.0|
|10817b19177f6a53e...|071

                                                                                


<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">7-Creating training and test data</h1>

In [12]:
(training,test)=transformed.randomSplit([0.8, 0.2])


<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">8-Creating ALS model and fitting data</h1>

To build the model explicitly specify the columns. Set nonnegative as ‘True’, since we are looking count greater than 0. The model also gives an option to select implicit ratings. Since we are working with explicit, set it to ‘False’ or by default it takes explicit.

When using simple random splits as in Spark’s CrossValidator or TrainValidationSplit, it is actually very common to encounter users and/or items in the evaluation set that are not in the training set. By default, Spark assigns NaN predictions during ALSModel.transform when a user and/or item factor is not present in the model.We set cold start strategy to ‘drop’ to ensure we don’t get NaN evaluation metrics.

In [13]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder


#create ALS model
als=ALS(userCol="customer_id_index",itemCol="article_id_index",ratingCol="count",coldStartStrategy="drop",nonnegative=True)

#tune model using ParamGridBuilder
param_grid = ParamGridBuilder()\
            .addGrid(als.rank, [15,20,25])\
            .addGrid(als.maxIter,[5,10,15])\
            .addGrid(als.regParam,[0.09,0.14,0.19])\
            .build()
#define evaluator as RMSE
evaluator = RegressionEvaluator(metricName = "rmse",labelCol = 'count', predictionCol = 'prediction')

#Build cross validation using CrossValidator
cv = CrossValidator(estimator=als,estimatorParamMaps=param_grid, evaluator=evaluator,numFolds=3)


#Fit ALS model to training data
model = cv.fit(training)

22/04/29 18:05:45 WARN DAGScheduler: Broadcasting large task binary with size 1233.8 KiB
22/04/29 18:05:45 WARN DAGScheduler: Broadcasting large task binary with size 1235.3 KiB
22/04/29 18:05:53 WARN DAGScheduler: Broadcasting large task binary with size 1236.8 KiB
22/04/29 18:06:00 WARN DAGScheduler: Broadcasting large task binary with size 1238.1 KiB
22/04/29 18:06:00 WARN DAGScheduler: Broadcasting large task binary with size 1237.0 KiB
22/04/29 18:06:04 WARN DAGScheduler: Broadcasting large task binary with size 1238.3 KiB
22/04/29 18:06:05 WARN DAGScheduler: Broadcasting large task binary with size 1239.1 KiB
22/04/29 18:06:05 WARN DAGScheduler: Broadcasting large task binary with size 1242.2 KiB
22/04/29 18:06:06 WARN DAGScheduler: Broadcasting large task binary with size 1243.6 KiB
22/04/29 18:06:06 WARN DAGScheduler: Broadcasting large task binary with size 1245.0 KiB
22/04/29 18:06:06 WARN DAGScheduler: Broadcasting large task binary with size 1246.3 KiB
22/04/29 18:06:07 WAR

In [None]:
""""""als=ALS(maxIter=5,regParam=0.09,rank=25,userCol="customer_id_index",itemCol="article_id_index",ratingCol="count",coldStartStrategy="drop",nonnegative=True)
model=als.fit(training)""""""


<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">9-Evaluate rmse</h1>

In [14]:
#Extract best model from the tuning exercise using ParamGridBuilder
best_model = model.bestModel

#Generate predictions and evaluate using RMSE
predictions = best_model.transform(test)
rmse = evaluator.evaluate(predictions)

22/04/29 19:06:21 WARN DAGScheduler: Broadcasting large task binary with size 1279.7 KiB
22/04/29 19:06:21 WARN DAGScheduler: Broadcasting large task binary with size 1278.3 KiB
22/04/29 19:08:17 WARN DAGScheduler: Broadcasting large task binary with size 1217.7 KiB
22/04/29 19:08:18 WARN DAGScheduler: Broadcasting large task binary with size 1324.4 KiB


In [15]:
#print evaluation metrics and model parameters
print("RMSE =" + str(rmse))
print("**Best Model**")
print("Rank : {}".format(best_model.rank))
print("MaxIter: {}".format(best_model._java_obj.parent().getMaxIter()))
print("RegParam: {}".format(best_model._java_obj.parent().getRegParam()))

RMSE =0.4176572461121671
**Best Model**
Rank : 15
MaxIter: 15
RegParam: 0.09


<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">10-Providing Recommendations by Article id</h1>

In [16]:
user_recs=best_model.recommendForAllItems(10).show(10)

22/04/29 19:08:42 WARN DAGScheduler: Broadcasting large task binary with size 1324.5 KiB
22/04/29 19:08:53 WARN DAGScheduler: Broadcasting large task binary with size 1301.9 KiB


+----------------+--------------------+
|article_id_index|     recommendations|
+----------------+--------------------+
|               1|[{9001, 5.307284}...|
|               3|[{9001, 6.7465367...|
|               5|[{9001, 5.697557}...|
|               6|[{4907, 5.1659846...|
|               9|[{9001, 5.090039}...|
|              12|[{4907, 5.4253325...|
|              13|[{9001, 5.3778076...|
|              15|[{9001, 6.311277}...|
|              16|[{9001, 5.5040755...|
|              17|[{4907, 6.5477405...|
+----------------+--------------------+
only showing top 10 rows



                                                                                


<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">11-Providing Recommendations by Customer id</h1>

In [25]:
df_recom = best_model.recommendForAllUsers(10)
df_recom.show(10)

22/04/29 19:16:38 WARN DAGScheduler: Broadcasting large task binary with size 1324.5 KiB

+-----------------+--------------------+
|customer_id_index|     recommendations|
+-----------------+--------------------+
|                1|[{1661, 2.1518571...|
|                3|[{5040, 3.1944647...|
|                5|[{1661, 3.2476745...|
|                6|[{1910, 5.81864},...|
|                9|[{1661, 2.1185696...|
|               12|[{1661, 2.3631198...|
|               13|[{1661, 2.4434402...|
|               15|[{6383, 2.3268082...|
|               16|[{6383, 2.1952343...|
|               17|[{1661, 2.2257655...|
+-----------------+--------------------+
only showing top 10 rows



22/04/29 19:16:50 WARN DAGScheduler: Broadcasting large task binary with size 1301.9 KiB
                                                                                

In [26]:
df_recom = df_recom.select("customer_id_index","recommendations.article_id_index")
df_recom.show(10)
df_recom = df_recom.toPandas()

22/04/29 19:17:04 WARN DAGScheduler: Broadcasting large task binary with size 1324.5 KiB
22/04/29 19:17:13 WARN DAGScheduler: Broadcasting large task binary with size 1302.2 KiB
                                                                                

+-----------------+--------------------+
|customer_id_index|    article_id_index|
+-----------------+--------------------+
|                1|[1661, 5040, 6383...|
|                3|[5040, 6383, 1661...|
|                5|[1661, 5111, 4910...|
|                6|[1910, 1661, 42, ...|
|                9|[1661, 6383, 5040...|
|               12|[1661, 4405, 4146...|
|               13|[1661, 6383, 5040...|
|               15|[6383, 5040, 1661...|
|               16|[6383, 5040, 1661...|
|               17|[1661, 6383, 5040...|
+-----------------+--------------------+
only showing top 10 rows



22/04/29 19:17:14 WARN DAGScheduler: Broadcasting large task binary with size 1324.5 KiB
22/04/29 19:17:24 WARN DAGScheduler: Broadcasting large task binary with size 1302.3 KiB
                                                                                

In [27]:
df_recom.sort_values('customer_id_index')

Unnamed: 0,customer_id_index,article_id_index
4803,0,"[1661, 5040, 6383, 4405, 5111, 4249, 4146, 303..."
0,1,"[1661, 5040, 6383, 4405, 5111, 3031, 4249, 489..."
4804,2,"[1661, 4212, 6383, 5040, 4405, 3018, 5111, 589..."
1,3,"[5040, 6383, 1661, 4146, 4249, 4405, 4894, 511..."
4805,4,"[5040, 6383, 1661, 3031, 5111, 4405, 4249, 316..."
...,...,...
9656,10522,"[6383, 5040, 1661, 5111, 4249, 4146, 4405, 489..."
9657,10523,"[1661, 4170, 6874, 1910, 3031, 7092, 368, 4405..."
4801,10524,"[1661, 6383, 5040, 4146, 4405, 5111, 4249, 303..."
9658,10525,"[2716, 4910, 1661, 5111, 3869, 3031, 7231, 122..."


<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">12-Converting back to string form</h1>

As seen in above image the results are in integer form we need to convert it back to its original name.The code is little bit longer given so many conversions.

In [28]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
import pandas as pd
md=transformed.select(transformed['article_id'],transformed['article_id_index'],transformed['customer_id'],transformed['customer_id_index'])
md=md.toPandas()
md

22/04/29 19:19:51 WARN DAGScheduler: Broadcasting large task binary with size 1205.2 KiB
                                                                                

Unnamed: 0,article_id,article_id_index,customer_id,customer_id_index
0,0780418013,2237.0,00f7bc5c0df4c615b2502a2c2e9ef9eff988c81dec2e5e...,783.0
1,0791587001,35.0,02094817e46f3b692149b06cf9577e42848c2294e78598...,785.0
2,0839332002,732.0,0333e5dda0257e9f498be52f1e569bfae576caed0cbdcd...,4098.0
3,0573085043,44.0,07c7a1172caf8fb9784b28e51b25b985ab6a1ec7ce923e...,1702.0
4,0714790020,5.0,081373184e601470cc9911f33d3eeebc6f33ed79222573...,4146.0
...,...,...,...,...
29481,0817150004,851.0,f8156f726aeaf44e90c1988837e13c9c9974ee19009e5b...,2594.0
29482,0893432005,96.0,f825202d015506981dc42c53afb7b56a36e05c85b67886...,1657.0
29483,0799754001,5499.0,faed38aaccd80db66f5a1581fe99af84e79fe398c91899...,10465.0
29484,0928206001,7.0,fcffcb9777aab7a53e3b382a840958d800e6d53bdd8a20...,4052.0


In [29]:
dict1 =dict(zip(md['article_id_index'],md['article_id']))
dict2=dict(zip(md['customer_id_index'],md['customer_id']))
df_recom['article_id'] = df_recom['article_id_index'].map(lambda x: [dict1[y] for y in x if y in dict1])
df_recom['customer_id']=df_recom['customer_id_index'].map(dict2)
df_recom

Unnamed: 0,customer_id_index,article_id_index,article_id,customer_id
0,1,"[1661, 5040, 6383, 4405, 5111, 3031, 4249, 489...","[0297078008, 0750481010, 0857347002, 057104800...",5e8fb4d457fdffc61e235328ba7e43a4139c94c5f9d52a...
1,3,"[5040, 6383, 1661, 4146, 4249, 4405, 4894, 511...","[0750481010, 0857347002, 0297078008, 031644100...",1796e87fd2e88932b50966a07cc18b490cd5e1474dbee3...
2,5,"[1661, 5111, 4910, 3869, 4249, 3031, 7073, 316...","[0297078008, 0757971006, 0724905016, 090496100...",f50287d9cf052d4b423fc3d4d7a0c306de8c752583544b...
3,6,"[1910, 1661, 42, 4405, 5891, 423, 3031, 368, 7...","[0880479001, 0297078008, 0934536001, 057104800...",54e8ebd39543b5a4d69c3e7d79977558d2a606e6540ba0...
4,9,"[1661, 6383, 5040, 4405, 4249, 3031, 5111, 707...","[0297078008, 0857347002, 0750481010, 057104800...",298523b6637340717e19df4e2a46a7ce7d80434c985a84...
...,...,...,...,...
9654,10519,"[1661, 5040, 6383, 4146, 4405, 5111, 4249, 303...","[0297078008, 0750481010, 0857347002, 031644100...",ff240ee1590922141103063f7b4212c3832f0f5b0e0eb2...
9655,10521,"[1661, 6383, 5040, 5891, 4894, 5111, 856, 6533...","[0297078008, 0857347002, 0750481010, 082510900...",ff6d8d22b25287dfb2b0bbec08d4425aa67fbf02911fe0...
9656,10522,"[6383, 5040, 1661, 5111, 4249, 4146, 4405, 489...","[0857347002, 0750481010, 0297078008, 075797100...",ff6f55a51af284b71dcd264396b299e548f968c1769e71...
9657,10523,"[1661, 4170, 6874, 1910, 3031, 7092, 368, 4405...","[0297078008, 0351484041, 0877261003, 088047900...",ff9e122067c18aac7bd96897bb9550405bb11abcc7e2e0...


In [34]:
recom_final = df_recom.drop(['customer_id_index','article_id_index'], axis = 1)
finalpre=recom_final[['customer_id','article_id']]
finalpre

Unnamed: 0,customer_id,article_id
0,5e8fb4d457fdffc61e235328ba7e43a4139c94c5f9d52a...,"[0297078008, 0750481010, 0857347002, 057104800..."
1,1796e87fd2e88932b50966a07cc18b490cd5e1474dbee3...,"[0750481010, 0857347002, 0297078008, 031644100..."
2,f50287d9cf052d4b423fc3d4d7a0c306de8c752583544b...,"[0297078008, 0757971006, 0724905016, 090496100..."
3,54e8ebd39543b5a4d69c3e7d79977558d2a606e6540ba0...,"[0880479001, 0297078008, 0934536001, 057104800..."
4,298523b6637340717e19df4e2a46a7ce7d80434c985a84...,"[0297078008, 0857347002, 0750481010, 057104800..."
...,...,...
9654,ff240ee1590922141103063f7b4212c3832f0f5b0e0eb2...,"[0297078008, 0750481010, 0857347002, 031644100..."
9655,ff6d8d22b25287dfb2b0bbec08d4425aa67fbf02911fe0...,"[0297078008, 0857347002, 0750481010, 082510900..."
9656,ff6f55a51af284b71dcd264396b299e548f968c1769e71...,"[0857347002, 0750481010, 0297078008, 075797100..."
9657,ff9e122067c18aac7bd96897bb9550405bb11abcc7e2e0...,"[0297078008, 0351484041, 0877261003, 088047900..."


<h1 style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;">13-Export the prediction</h1>

In [35]:
my_pred = finalpre.toPandas()
my_pred.to_csv('my_pred.csv',index=False)

AttributeError: 'DataFrame' object has no attribute 'toPandas'

# <p style="background-color:#f7e9ec;font-family:newtimeroman;color:#d90b1c;font-size:150%;text-align:center;border-radius:10px 10px;border-style:solid;border-color:#d90b1c;">Please do leave your comments /suggestions and if you like this kernel greatly appreciate to UPVOTE</p>