This Notebook is the official version of the EDA code which includes the following 

<li>Loading of the libraries and Amazon Dataframe
<li>Dataset Statistics
<li>Pre-Processing 
<li>Sentiment Analysis
<li>EDA 
<li>Export to files (for Vizualisation in Tableau)
</li>

Sampling - Cmd 13 requires modification based on the size of the sample that is required for the EDA. Default is 0.40.

In [2]:
#Import SPARK NLP and NLTK print version
import sparknlp
print("Spark NLP version")
sparknlp.version()

import nltk
nltk.download('stopwords')

#Import relevant packages for pipeline
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover
from nltk.corpus import stopwords


import pyspark.sql.functions as f
from pyspark.sql.functions import length, when

In [3]:
# Load in one of the tables
df1 = spark.sql("select 'games' as type, * from default.video_games_5")
df2 = spark.sql("select 'books' as type,* from default.books_5_small")
df3 = spark.sql("select 'home' as type,* from default.home_and_kitchen_5_small")

#Add dataset label to determine if its a book, videogame or home review (Nic+Jessee)

df_complete = df1.union(df2).union(df3)
print((df_complete.count(), len(df_complete.columns)))

###Dataset Statistics

In [5]:
#Review per category
df_complete.groupBy('type').count().show()

In [6]:
#Count of reviews per rating
display(df_complete.groupBy('overall').count())

overall,count
1.0,179046
4.0,612575
3.0,291621
2.0,153010
5.0,2251079


In [7]:
#Distinct ReviewerID
from pyspark.sql.functions import countDistinct
df_complete.agg(countDistinct("reviewerID")).show()


In [8]:
#Number of unique Items
from pyspark.sql.functions import countDistinct
df_complete.agg(countDistinct("asin")).show()

In [9]:
#Most common review Name
from pyspark.sql.functions import desc
df_complete.groupBy("asin").count().sort(desc("count")).show()

In [10]:
#Most common review Name
from pyspark.sql.functions import desc
df_complete.groupBy("ReviewerName").count().sort(desc("count")).show()

In [11]:
#Reviewer with highest count of review
from pyspark.sql.functions import desc
display(df_complete.groupBy("reviewerID").count().sort(desc("count")))

In [12]:
#Display the reviewer with the highest ammount of reviews.
display(df_complete.filter("ReviewerID = 'A3V6Z4RCDGRC44'").sort(desc("reviewTime")))

type,reviewID,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,label
games,51169,5.0,10,False,"12 7, 2003",A3V6Z4RCDGRC44,B0000CC7H5,Lisa Shea,"In this puzzle-adventure set in an Egypt-like location, you switch between two characters. One is Sphinx, a tail-toting teenage guy, and the other is a mummy. The graphics in the game are great. Your tail swishes from side to side, the wooden bridges sag beneath your weight, your shadows move along with you. It has a cartoony feel to it, which is fine for its style. The smoke and fire aren't very well done, but those are always tricky. While sound really shouldn't be a key for a game, it is sadly lacking in this one. The background music and ambient noises are OK, usually very faint. But this game has a ton of dialogue and there is no speaking! That might have been fine back in the days of Final Fantasy 8, but in this day and age it really feels lacking every time an important dialogue is going on and there is just ... silence. The actual gameplay is very enjoyable. The characters' movements are smooth and natural, from swinging up a rope to leaping across chasms. Each map has enough area for you to explore and not feel stuck, but it's not so huge that you get lost. There are good assortments of puzzles and action sequences to keep both your fingers and your brain engaged. Sure, a few of the puzzles are a bit hokey - but that's the case in any game. The save feature isn't great, but you learn to deal with that. In general it's a fun game for all ages!",Great graphics and interesting gameplay,1070755200,1
games,36983,5.0,2,True,"12 7, 2003",A3V6Z4RCDGRC44,B00006F2ZQ,Lisa Shea,"The Metal Gear Solid series is famous for its intense gameplay. Metal Gear Solid 2 adds to the series with combat, stealth, strategy and intrigue. You are Raiden / Snake, part of a covert military group. The graphics in this game are just amazing. The movements, surroundings, lighting are very well done. You can peer around corners, crouch down, crawl along to sneak up up enemies. The realism isn't quite as nice as some other games out there - it's not ""gritty"" - but it's close. The sounds fit in well with the game. The background music is moody and suspense-filled, and the ambient noises of footfalls and doors opening and closing all reallyd raw you into the game. The gameplay is great, in that it's not a blast-em-up button masher. You have to think about your surroundings and work to take out your enemies in the most efficient way you can. You can replay levels to find new and better ways to get through the level. Where the game drags is in the PLOT. First off, the plot is more like a soap opera, with all sorts of really bizarre twists and interpersonal conflicts. Second, this is supposed to be a stealth-trained guy, a la splinter cell. Instead, he gets these INCREDIBLY long discussions while he's in the middle of a mission! They should have had briefing sessions BEFORE each mission that you could skip, and then when you were actually in the mission you should just do it. There could maybe be an ""on line encyclopedia"" you could go to if you got stuck - but to keep deliberately interrupting the game for these dialogues was poor. Still, you can always have a book nearby to read while those dialogues go on - I hate to skip them just in case something important IS said in them. Usually nothing is, but you never know. If you can last through those cut-scenes, the rest of the game is really enjoyable and well worth playing. It's a lesson in patience!",Great gameplay but LONG cut scenes,1070755200,1
games,35647,5.0,8,False,"12 7, 2003",A3V6Z4RCDGRC44,B000069BCN,Lisa Shea,"Hitman 2 is a first person shooter which rewards control and stealth over full-out blasting. Gorgeous graphics and freeform gameplay make this great for teenagers and adults. To start out, you're a retired hitman, enjoying the beauty, peace, and serenity of Sicily, Italy. You talk with your friend, the priest, about your hope for being one with God. Unfortunately for you, this is when the bad guys show up and kidnap the priest. You don't have the $500,000 ransom they demand, so you call up your agency and arrange to do some missions for them in return for help rescuing the priest. And you're off! You're wandering through very well done maps, with a diverse collection of tools at your disposal. Do you pick the locks to go in the back way? Shoot the guy from the trees, sneak up and wire him, maybe chloroform him in a corner? The game is very open-pathed. You choose which way to approach a mission, what might work best. Your opponents are smart and react to finding bodies or to unusual behavior. This isn't a go-here-then-go-there game, it's a game where you're placed in the world and have to decide which approach works best for you. There's a stealth meter, you can hear your footsteps, you can see your shadow. Having just finished playing Splinter Cell, though, the game suffers a bit in comparison. The movements are more jerky, the lighting not nearly as nicely done. You can peer through keyholes, but not around a partially opened door, or shoot around the door. You can't move along walls and peer around the corner. So some stealth techniques we found quite natural and helpful from Spliter Cell were impossible here. Also, the open-endedness of the missions can lead to frustration. Even taking one step two far or opening the wrong door can lead to mission failures, so you can spend hours on a single mission trying to figure out which technique will work. In essence you learn what each AI character does and then work around it - which isn't very realistic. Still, the game is quite fun to play, the graphics are very nice, and the missions are interested. Recommended for any fan of the stealth-fps genre.",Rewarding Strategy/Stealth FPS Gaming,1070755200,1
games,197886,5.0,3,False,"12 5, 2009",A3V6Z4RCDGRC44,B002BRYXRQ,Lisa Shea,"The polished successor to last year's hit ""Left 4 Dead"", Left 4 Dead 2 expands the basic gameplay of the previous game and adds several new features to the mix. The game provokes mixed feelings, on one hand coming out only a year after the original L4D (which didn't receive much downloadable content) and on the other hand being full of enough new content to justify a sequel. Like the first L4D, the game puts players in the shoes of four survivors of a zombie apocalypse. The survivors battle waves of the infected to move from one safe area to another. In L4D, the tools used by the survivors were primarily guns and explosive objects like pipe bombs and fuel cans. One of L4D's shortcomings was that it never really felt like ""survival"" - ammo could be found easily, and you were never really short on resources. L4D2 keeps the basic premise - pistols still have infinite ammo, occasional caches of ammo can be found - but generally increases the scarcity of resources to make for a tenser survival experience. In addition, the choice of weaponry is greatly increased from L4D: in addition to a wider range of guns (rather than the 7 offered in L4D, 3 of which were an improved version of the other 3) there are also melee weapons. These allow for close-range damage without needing to reload. There are also more tools and items to be found, like adrenaline that increases your action speed, ""boomer bile"" which causes the infected to fight amongst themselves, and ammo boxes that make your guns shoot burning or explosive ammunition. In addition, there are occasional ""special infected"" with unique abilities who will show up to attack. In L4D, these were the ""hunter"", who pounced on his prey, the ""smoker"", who strangled them with his long tongue, and the ""boomer"", who covered survivors with zombie-attracting bile. In addition, the ""tank"" (a physical powerhouse) and the ""witch"" (a super-powerful zombie who could be avoided if the survivors were careful) also made less-frequent appearances. In L4D2, these special infected are joined by the ""spitter"", who launches gobs of area-covering acid, the ""charger"", who rams into survivors and pummels them, and the ""jockey"", who jumps on a survivor and attempts to steer them into nearby hazards. The new special infected affect the balance of the game overall - the spitter forces survivors to keep moving instead of holing up in one spot and the jockey uses environmental hazards to his advantage. The charger is a little annoying, because even if you spot it early, unlike the other special infected it can take so many bullets that you usually end up getting hit anyways. There are five campaigns, all with much more unique layouts than the campaigns in L4D. ""Dead Center"" is a fight through a shopping mall, ""Dark Carnival"" takes place in an amusement park, ""Swamp Fever"" sees the survivors attempting to get through the bayou, ""Hard Rain"" involves a two-way trip to get gas during a major thunderstorm, and finally ""The Parish"" puts you on the streets of New Orleans attempting to reach the last helicopter out. The maps are all much more fleshed out than they are in L4D, and due to the expanded list of items and bonuses in the game there's much more potential benefit to exploring out-of-the-way locations like abandoned houses. There's also more choice in route, though the paths are always ultimately linear. The campaigns on the whole feel much more unique and thematic than their L4D counterparts, and there's a distinct flavor to each of them. There are more game modes than in L4D. In addition to the regular campaign (human players versus ai-controlled infected), survival (players hold out for as long as they can in one area), and the versus mode (player-controlled survivors versus player-controlled special infected), there's also realism mode (a campaign wherein many elements like glowing silhouettes are disabled, making it harder for players to find each other without good teamwork) and scavenge mode (where players try to collect gas cans and bring them back to a central location). The versus mode is much more fun than in L4D because of the new special infected, but the new game modes are sort of iffy. Overall, L4D2 is a great game - much improved over its predecessor - but it still feels like it's lacking a lot. The new melee weapons are all basically identical, so there's no real sense of scavenging or anything like that. Everything from a fire axe to a cricket bat is completely capable of tearing apart a zombie with one swing. The new guns are a little better - I liked the idea of adding a grenade launcher that has non-replenishable ammo - but they're so easy to find that there's still really no tension. Basically, though, the game is an action game, not a survival game. If you want to have fun playing cooperatively with your friends, L4D2 is probably the best game to do it with. Rating: 9/10.",expands the basic gameplay of the previous game,1259971200,1
books,121629,5.0,0,False,"12 5, 2000",A3V6Z4RCDGRC44,0007141882,Lisa Shea,"Dr. Seuss is a brilliant storyteller who got into writing childrens books hoping to give kids something fun to read. At the time, most children only had options like Dick and Jane ... nothing that captured the imagination. The Grinch certainly was a fresh, exciting change! A classic that many associate easily with the holiday season, the Grinch is a mean-hearted grouch that simply hates everything about caring, and giving, and sharing. He sets forth to destroy the holidays for the delightfully innocent Whos in Who-ville. His plans are brought to a crashing halt when he realizes that the Whos do not care about the presents and trapings he stole - they are joyfully happy because they have each other, and they have the celebration of the day. This knowledge converts the Grinch into a kinder, gentler creature. It's really a powerful message for just about anyone in this commercial age - that in the end it's not the boxes and shiny paper under the tree, it's the joy and happiness of being together and sharing this special season. Best of all, this message isn't given in an overbearing manner. It's done in an extremely lighthearted, open and joyous storytelling with GORGEOUS graphics and a deft touch. I loved this book as a child, and I love it even more as an adult, reading it to my own child. Grab yourself a copy or two, and give it out to friends. You'd be surprised how much meaning this simple book still holds.",A true classic for readers of all ages,975974400,0
games,45020,4.0,16,False,"12 4, 2003",A3V6Z4RCDGRC44,B00008XPJH,Lisa Shea,"The Lord of the Rings trilogy has been getting quite popular in the past few years. It was inevitable that the original story - of Bilbo Baggins finding the ring - would become a new computer game. Bilbo Baggins is enlisted by Gandalf to go along with a group of dwarves to explore and steal back a treasure from the dragon Smaug. As you go you gain skill with your sword and courage for your hobbit brain. The game is on the cutesy side and bears no resemblance to the movie hobbits. It's more based on the infamous cartoon version of the Hobbit by Arthur Rankin Jr, with round-bellied short creatures that refuse to talk of adventure. The game involves a lot of jumping from pillar to pillar and going on eternal quests fetching needles and hammers and nails. Yes, there's bashing involved too, mostly of the jump-and-slash, jump-and-slash, jump-and-slash kind. Don't get me wrong - I *love* Lord of the Rings and will eagerly buy any game based on the series. But I also love great gameplay. I would play Zelda for months and months, and the games based on the movies in the Lord of the Rings trilogy were superb. This one is on the cute side and is more about running through bushes and shrubs gathering up ""courage gems"" than any actual real thought. You could easily win this game in a few hours. Sure, you can then go back and find ""every last gem"" ... but really, running around every square inch of ground hoping mysterious gems pop out of the ground isn't my idea of fun. I'd recommend renting this one and playing it for a weekend. You'll easily win the game several times in that period. If you still find the game fun, buy yourself a copy and enjoy! But if you've tired of the incessant repetitive music, jumping platform areas and tediously long speeches by then, you'll be quite happy to trade the game back in and get something else. I recommend Return of the King - now there's a game I love playing over and over!","Rent first, buy only if you still enjoy it",1070496000,1
home,266450,5.0,0,False,"12 31, 2013",A3V6Z4RCDGRC44,B0001J05IC,Lisa Shea,"This review is for the Venta Airwasher Humidifier LW15 WHITE unit. We have used many, many humidifiers over the years. We have forced hot water heat and the house gets incredibly dry in the winter. We have parakeets, cats, and a plethora of plants, so it's important to us to keep the house at a decent humidity level. It's of course good for us humans as well :). So when we were offered a review copy of this Venta LW15 unit, we jumped at it. How much better could it be compared with all these other humidifiers we've tried? The Venta has two speeds - high and low. Even on high, the fan is not crazy loud. We can watch TV with the unit nearby. I'll note of course it's not SILENT. The quietest humidifier we have is an ultrasonic. But the ultrasonic can't handle the volume that this LW15 does. The LW15 works with a turning plastic wheel. That wheel turns, water fills in the gaps, and the water then comes out into the room. So there's no filter to change. Just a turning plastic wheel to clean. We were impressed that this seems to be one of the most intelligently built humidifiers we've used. The fan sucks dry air down and blows it out over the water wheel, vs the water-laden air going over the fan mechanism and slowly damaging it. So to reiterate, this unit doesn't need any filters. Because there's nothing interacting with the water like a filter, you only need to add a liquid additive every few weeks. That additive doesn't evaporate or get wicked away so you don't need to replace it with every refill. The unit has a detector for emptiness, which would seem like a no-brainer, but we have several units that do NOT know when they are empty and just keep running. So it gets credit for that. On the other hand, since this is a turning wheel apparatus, by the time it gets to no water, it's already been running fairly ineffectively for a while. So it'd be nice if it had a yellow light when it was getting low, so you could refill it then. The effectiveness seems to vary with amount of water in it. As the water levels goes down, it doesn't seem to humidify as well. Again, this has to do with the wheel design. For some strange reason, there's no humidity detection. The unit is either on or off. That's it. This baffles us. Having a unit that knows when to run and not to run is fairly basic. Also, in terms of ""forcing"" the air to become fairly humid, this unit can't do that. It's not blasting vapor into air. It's just presenting water to the air that the air CAN pick up if it's a bit dry. So in general we find that our room won't go higher than 50% because air won't pick up the water any more. But it works nicely to get a room into the 40-50% range. It has a nice, large capacity - 1.3 gallons. Also - and this is great!! - the filling bin sits flat in the sink. It's very nice for filling. Again you'd think all units should do this, but so many that we own have bizarre, art-deco style tanks which don't sit flat in the sink. It makes them a royal pain to fill. You can add fragrance oil into the water. We've tested this out and it does add a nice aroma to the room. This isn't something you can normally do with the filter style units. A final nice feature - when you refill it, it remembers the speed it was on before. Many units completely forget everything and you have to start again. Not a huge deal, but it's nice. Well recommended. This unit is fairly similar to the LW45, which we also reviewed. We received a review copy of this unit.",Great with some issues,1388448000,0
home,266451,5.0,3,False,"12 31, 2013",A3V6Z4RCDGRC44,B0001J05IC,Lisa Shea,"This review is for the Venta Airwasher Humidifier LW45 GREY unit. We have used many, many humidifiers over the years. We have forced hot water heat and the house gets incredibly dry in the winter. We have parakeets, cats, and a plethora of plants, so it's important to us to keep the house at a decent humidity level. It's of course good for us humans as well :). So when we were offered a review copy of this Venta LW45 unit, we jumped at it. How much better could it be compared with all these other humidifiers we've tried? The Venta has three speeds - high, medium, and low. Even on high, the fan is not crazy loud. We can watch TV with the unit nearby. I'll note of course it's not SILENT. The quietest humidifier we have is an ultrasonic. But the ultrasonic can't handle the volume that this LW45 does. The LW45 works with two plastic wheels. Those wheels turn, water fills in the gaps, and the water then comes out into the room. So there's no filter to change. Just a turning plastic wheel to clean. We were impressed that this seems to be one of the most intelligently built humidifiers we've used. The fan sucks dry air down and blows it out over the water wheel, vs the water-laden air going over the fan mechanism and slowly damaging it. So to reiterate, this unit doesn't need any filters. Because there's nothing interacting with the water like a filter, you only need to add a liquid additive every few weeks. That additive doesn't evaporate or get wicked away so you don't need to replace it with every refill. The unit has a detector for emptiness, which would seem like a no-brainer, but we have several units that do NOT know when they are empty and just keep running. So it gets credit for that. On the other hand, since this is a turning wheel apparatus, by the time it gets to no water, it's already been running fairly ineffectively for a while. So it'd be nice if it had a yellow light when it was getting low, so you could refill it then. The effectiveness seems to vary with amount of water in it. As the water levels goes down, it doesn't seem to humidify as well. Again, this has to do with the wheel design. For some strange reason, there's no humidity detection. The unit is either on or off. That's it. This baffles us. Having a unit that knows when to run and not to run is fairly basic. With the large tank, though, it can run at high all day long and not run out. Also, in terms of ""forcing"" the air to become fairly humid, this unit can't do that. It's not blasting vapor into air. It's just presenting water to the air that the air CAN pick up if it's a bit dry. So in general we find that our room won't go higher than 50% because air won't pick up the water any more. But it works nicely to get a room into the 40-50% range. It has a nice, large capacity - 3 gallons. Also - and this is great!! - the filling bin sits flat in the sink. It's very nice for filling. Again you'd think all units should do this, but so many that we own have bizarre, art-deco style tanks which don't sit flat in the sink. It makes them a royal pain to fill. You can add fragrance oil into the water. We've tested this out and it does add a nice aroma to the room. This isn't something you can normally do with the filter style units. A final nice feature - when you refill it, it remembers the speed it was on before. Many units completely forget everything and you have to start again. Not a huge deal, but it's nice. Well recommended. This unit is fairly similar to the LW15, which we also reviewed. We received a review copy of this unit.",Great with Some Issues,1388448000,1
games,6982,5.0,0,False,"12 31, 2010",A3V6Z4RCDGRC44,B00002NDRY,Lisa Shea,"Most games are hot when they release and then quickly become old and dated. Age of Empires II - The Age of Kings stands the test of time. This is a game that is fun even years after its release. You would think a game created in 1999 would seem blocky and old after only a year or two. Quite the contrary. The gameplay, immersion in detail, and strategic fun in Age of Empires II makes it into a game that I envision will be playable and fun for years to come. The game takes you through a variety of situations. You need to be an efficient hunter-gatherer. You have to know how to manage naval warfare. It's a great test of your ability to think outside the box, as you investigate and manage a new set of challenges with every pass. I adore history, so it was a treat to be in a scenario with William Wallace and Joan of Arc. You see the real life locations they had to fight through, and you save the day with your own activities. The locations span the globe, keeping things interesting. I realize that some people aren't interested in multiplayer and certainly there is enough here in the game to keep a solo player happy. However, those who do love multiplayer will be in for a treat. The game does an excellent job of keeping all sides engaged in the battle. I happen to love the multiplayer aspect here. Yes, the graphics aren't super high def and up to current standards. That is quite all right by me. The gameplay itself is well done, and really after a few minutes I don't even notice what the details of the various structures look like. I am too busy racing around the screen planning my defenses. As long as the on screen icons are easy to identify and work with, that's all that matters to me. Are there any downsides? I imagine gamers who are most focused on first person shooters, who want to aim rifles at heads and shoot, probably won't be as keen on the strategy based, long term planning involved here. That is quite fine. There are different types of games for different players. For strategy fans, Age of Empires II is quite a treat. We purchased Age of Empires II with our own funds.",A Must-Have for a Strategy Gaming Fan,1293753600,0
games,37643,5.0,0,False,"12 31, 2010",A3V6Z4RCDGRC44,B00006GEX2,Lisa Shea,"Age of Empires 1 and 2 were incredible games with great graphics and gameplay. Age of Mythology takes that engine, beefs up the graphics again and adds in populus-like god power! First, I have to admit that I am a huge Age of Empires fan. I asked everybody around me to play with me, and played it solo just about every chance I could get. I thought the graphics and gameplay and customization were just great. I'm also a populus player from way back, and have eagerly played each release of Populus and Black & White. I enjoy the god-games very much. So Age of Mythology was just about the perfect game that I could ask for! How did it meet my expectations? First, gameplay is just like in Age of Empires. These guys have made a number of games in the Age of Empires series and have received feedback on each set of changes they made. At this point the interface and training are WELL honed and very easy to use. The tutorial helps guide you quickly and easily through learning the system. The graphics are simply amazing. The gorgeous intro video is classic. The in-game graphics are great - just watch those waves lap up on the beach! It's a true pleasure to play. The sounds are well matched to each culture you play, with the characters speaking in the appropriate language for 'background chatter' and ambient noises coming in. It's not intended to be fully immersive like Myst 3 - this is a strategy game, not a RPG. But they do a good job with the sounds. The cultures you choose from are Egyptian, Greek, and Norse. In each culture you have appropriate buildings and troops, plus the gods to choose from. At each stage in development you choose a new additional god to pray to which gives you access to new powers. Some of the powers are pretty amazing! Multiplayer, as always, shines in this game. You can create maps with all the terrain types, devise your own battle situations, play on line against thousands of other Age of Empire enthusiasts. The gameplay that comes with the game is just the start - on line multiplay is where you really test out your strategic thinking. My complaint with the game is that it's too limited in the cultures and gods you get. They have an incredible engine that has been developed over the years, and it shines - but you only get 3 cultures. And those three cultures have very few gods. The 'cheat sheets' that come with the game make it seem like you have a lot of gods and a lot of combinations, but look more closely - it's just the same gods listed in three columns. Undoubtedly their plan is to get people hooked on Age of Mythology, and then slowly dole out new cultures and new gods over time, making us pay for each additional one. If they'd at least started us with five or six cultures that might not have been as bad, but three is sort of flimsy. Still, I'll be playing this for months, and when those expansions come out, I'll be there to buy them! We purchased Age of Mythology with our own funds from a gaming store.",Stellar Game,1293753600,0


In [13]:
#Creates a sample of the full dataset for the EDA. Recommend using 40%.
df = df_complete.sample(fraction=0.4, seed=4) #this is the dataset without the term frequency 

In [14]:
# Drop duplicates

df = df.dropDuplicates(['reviewerID', 'asin'])


In [15]:
# Convert Unix timestamp to readable date

from pyspark.sql.functions import from_unixtime, to_date

df = df.withColumn("reviewTime", to_date(from_unixtime(df.unixReviewTime))) \
                                                .drop("unixReviewTime")

# Fill in the empty vote column with 0, and convert it to numeric type

from pyspark.sql.types import *

df = df.withColumn("vote", df.vote.cast(IntegerType())) \
                                                 .fillna(0, subset=["vote"]) 

In [16]:
from pyspark.sql import functions as sf
df_joined = df.withColumn('joined_column', 
                    sf.concat(sf.col('summary'),sf.lit(' '), sf.col('reviewText')))

#Drop Rows with Null in Joined Column
df_joined=df_joined.filter(sf.col('joined_column').isNotNull())        
#print("New number of NAs/Nulls")
newNA=df_joined.select('joined_column').withColumn('isNull_c',sf.col('joined_column').isNull()).where('isNull_c = True').count()
#print(newNA)

In [17]:

#Download stopwords dictionary
eng_stopwords = stopwords.words('english')

#Start SPARK NLP
spark = sparknlp.start()

#Make new Dataframe
df_preprocess=df_joined

#Create different components of the Pipeline steps-> reads in file -> tokenize -> removes stop words -> lemmatizes -> unigrams/ngrams ->outputs columns
document = DocumentAssembler()\
    .setInputCol("joined_column")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setSplitChars(['-']) \
    .setContextChars(['(', ')', '?', '!'])

remover = StopWordsCleaner() \
    .setInputCols(["token"])\
    .setOutputCol("nostopwords")\
    .setStopWords(eng_stopwords)

normalizer = Normalizer() \
     .setInputCols(['nostopwords']) \
     .setOutputCol('normalized') \
     .setLowercase(True)

normalizerCaps = Normalizer() \
     .setInputCols(['nostopwords']) \
     .setOutputCol('normalized_wCaps') \
     .setLowercase(False)

stemmer = Stemmer() \
    .setInputCols(["normalized"]) \
    .setOutputCol("stem")

lemmatizer = LemmatizerModel.pretrained(name="lemma_antbnc", lang="en") \
    .setInputCols(["normalized"]) \
    .setOutputCol("lemmat") 

ngram = NGramGenerator() \
            .setInputCols(["normalized"]) \
            .setOutputCol("ngrams") \
            .setN(2) \
            .setEnableCumulative(True)\
            .setDelimiter("_")
  
pos = PerceptronModel.pretrained('pos_anc') \
     .setInputCols(['document', 'normalized']) \
     .setOutputCol('pos')

allowed_tags = ['<JJ>+<NN>', '<NN>+<NN>'] #setting ngrams to meaningful sets for the Chunker function

chunker = Chunker() \
     .setInputCols(['document', 'pos']) \
     .setOutputCol('nmeaning') \
     .setRegexParsers(allowed_tags)

finisher = Finisher() \
     .setInputCols(['token','nostopwords','normalized','normalized_wCaps','stem','lemmat','ngrams', 'nmeaning']) #these are the columns that will be printed to the df
  
pipeline = Pipeline(
    stages = [
        document,
        tokenizer,
        remover,
        normalizer,
        normalizerCaps,
        stemmer,
        lemmatizer,
        ngram,
        pos,
        chunker,
        finisher
    ])

pipelineModel = pipeline.fit(df_preprocess)
df_preprocess = pipelineModel.transform(df_preprocess)


In [18]:
import pyspark.sql.functions as f
from pyspark.sql.functions import length, when

#Count number of words
df_preprocess = df_preprocess.withColumn('wordCount', f.size(f.split(f.col('joined_column'), ' ')))
df_preprocess = df_preprocess.withColumn('wordCount_normalized', f.size(f.col('finished_normalized_wCaps')))

#Count number of char
df_preprocess = df_preprocess.withColumn('charCount', length('joined_column'))

###Most popular N-Gram over Time

In [20]:
df_preprocess.cache()
df_word_time = df_preprocess.select(f.explode('finished_nmeaning'), 'reviewTime')

In [21]:
df_word_time2 = df_word_time.filter(df_word_time.col=="non-stick")
display(df_word_time2)

In [22]:
df_word_time3 = df_word_time.filter(df_word_time.col=="easy to clean")
display(df_word_time3)

col,reviewTime
easy to clean,2017-02-26
easy to clean,2015-09-09
easy to clean,2013-10-17
easy to clean,2017-11-02
easy to clean,2014-11-05
easy to clean,2011-05-20
easy to clean,2013-11-09
easy to clean,2011-10-23
easy to clean,2013-10-06
easy to clean,2013-06-12


In [23]:
df_word_time4 = df_word_time.filter(df_word_time.col=="easy to install")
display(df_word_time4)

col,reviewTime
easy to install,2015-10-27
easy to install,2015-10-27
easy to install,2014-07-22
easy to install,2013-12-29
easy to install,2015-12-16
easy to install,2015-12-16
easy to install,2016-10-07
easy to install,2016-08-25
easy to install,2016-08-25
easy to install,2015-10-28


In [24]:
df_word_time5 = df_word_time.filter(df_word_time.col=="console game")
display(df_word_time5)

col,reviewTime
console game,2007-12-06
console game,2007-01-14
console game,2014-09-14
console game,2011-12-20
console game,2009-08-11
console game,2008-04-15
console game,2013-04-29
console game,2006-12-28
console game,2009-12-06
console game,2003-06-24


In [25]:
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

pydf_to_pd = df_preprocess.select("reviewID", "reviewerID", "asin", "joined_column")
pydf_to_pd
df_pd = pydf_to_pd.toPandas()

#SENTIMENT ANALYSIS
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

df_pd['reviewTextpolarity'] = df_pd['joined_column'].apply(pol)
df_pd['reviewTextsubjectivity'] = df_pd['joined_column'].apply(sub)
#df_small_pd['summarypolarity'] = df_small_pd['summary'].apply(pol)
#df_small_pd['summarysubjectivity'] = df_small_pd['summary'].apply(sub)

#convert pd df to pyspark df
df_pyspark = spark.createDataFrame(df_pd)

#join sentiment using the 3 IDs
df_final = df_preprocess.join(df_pyspark, (df_preprocess.reviewID == df_pyspark.reviewID) &  (df_preprocess.reviewerID == df_pyspark.reviewerID) & (df_preprocess.asin == df_pyspark.asin), how='left')

In [26]:

df_cleaned = df_final
df_cleaned.createOrReplaceTempView("df_cleaned")

In [27]:
#Dropping some non required rows to reduce memory requirement.
df_cleaned = df_cleaned.drop('finished_nostopwords','finished_stem', 'finised_lemmat')

In [28]:
#Number of distinct words per category 
from pyspark.sql.functions import countDistinct
df_cleaned.groupBy('type').agg(countDistinct('finished_normalized_wCaps')).show()

In [29]:
#Statistics on the number of words (min, max ,avg) per review
df_cleaned.groupBy('type').max("WordCount").show()
df_cleaned.groupBy('type').min("WordCount").show()
df_cleaned.groupBy('type').avg("WordCount").show()

###Create Table for tableau

In [31]:
#data table which includes tokens
word_eda = spark.sql("Select type as Category, overall as Rating, vote as HelpfulVote, verified as Verified, label as Useful, reviewTime as ReviewTime, reviewTextsubjectivity as Subjectivity, finished_ngrams as Tokens_Normalized, reviewTextpolarity as Polarity from df_cleaned")

In [32]:
word_eda.write.format("parquet").saveAsTable("GMMA2021_new_york.df_eda_word_final")

In [33]:
#Simplified EDA table without Tokens
eda = spark.sql("Select type as Category, overall as Rating, vote as HelpfulVote, verified as Verified, label as Useful, reviewTime as ReviewTime, reviewTextpolarity as Polarity, reviewTextsubjectivity as Subjectivity, wordCount_normalized as WordCountNormalized, wordCount as WordCount from df_cleaned")


In [34]:
eda.write.format("parquet").saveAsTable("GMMA2021_new_york.df_eda_final")