# NLP in Pyspark's MLlib Project

## Fake Job Posting Predictions

Indeed.com has just hired you to create a system that automatically flags suspicious job postings on it's website. It has recently seen an influx of fake job postings that is negativley impacting it's customer experience. Becuase of the high volume of job postings it receives everyday, their employees do have the capacity to check every posting so they would like prioritize which postings to review before deleting it. 

#### Your task
Use the attached dataset with NLP to create an alogorthim which automatically flags suspicious posts for review. 

#### The data
This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs.

**Data Source:** https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction

#### Have fun!

In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Flag Suspicious Posts').getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print('You are working with', cores, 'core(s)')
spark

You are working with 1 core(s)


In [2]:
# !pip install sparknlp
# import sparknlp
# from sparknlp.annotator import *
# from sparknlp.pretrained import *

# spark = sparknlp.start()

In [3]:
from pyspark.ml.feature import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml import Pipeline 

In [4]:
fake_jobs = spark.read.csv('Datasets/fake_job_postings.csv', inferSchema=True, header=True)

In [5]:
fake_jobs.limit(5).toPandas()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [6]:
fake_jobs.printSchema()

root
 |-- job_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- location: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary_range: string (nullable = true)
 |-- company_profile: string (nullable = true)
 |-- description: string (nullable = true)
 |-- requirements: string (nullable = true)
 |-- benefits: string (nullable = true)
 |-- telecommuting: string (nullable = true)
 |-- has_company_logo: string (nullable = true)
 |-- has_questions: string (nullable = true)
 |-- employment_type: string (nullable = true)
 |-- required_experience: string (nullable = true)
 |-- required_education: string (nullable = true)
 |-- industry: string (nullable = true)
 |-- function: string (nullable = true)
 |-- fraudulent: string (nullable = true)



In [7]:
def null_value_calc(df):
    null_columns_counts = []
    numRows = df.count()
    for k in df.columns:
        nullRows = df.where(col(k).isNull()).count()
        if(nullRows > 0):
            temp = k, nullRows,(nullRows/numRows)*100
            null_columns_counts.append(temp)
    return(null_columns_counts)

null_columns_calc_list = sorted(null_value_calc(fake_jobs), reverse=True, key=lambda x: x[-1])

spark.createDataFrame(null_columns_calc_list, ['Column_Name', 'Null_Values_Count','Null_Value_Percent']). \
withColumn('Null_Value_Percent', round('Null_Value_Percent', 3)).show()

+-------------------+-----------------+------------------+
|        Column_Name|Null_Values_Count|Null_Value_Percent|
+-------------------+-----------------+------------------+
|       salary_range|            15011|            83.954|
|         department|            11547|            64.581|
| required_education|             7748|            43.333|
|           benefits|             6966|             38.96|
|required_experience|             6723|            37.601|
|           function|             6317|             35.33|
|           industry|             4831|            27.019|
|    company_profile|             3308|            18.501|
|    employment_type|             3292|            18.412|
|       requirements|             2573|             14.39|
|           location|              346|             1.935|
|         fraudulent|              176|             0.984|
|      telecommuting|               89|             0.498|
|      has_questions|               30|             0.16

In [8]:
fake_jobs.groupBy('fraudulent').count().orderBy('count', ascending=False).show()

+--------------------+-----+
|          fraudulent|count|
+--------------------+-----+
|                   0|16080|
|                   1|  886|
|                null|  176|
|           Full-time|   73|
|Hospital & Health...|   55|
|   Bachelor's Degree|   53|
|         Engineering|   26|
| perform quality ...|   17|
|         Unspecified|   15|
|    Mid-Senior level|   15|
|           Associate|   14|
|               Sales|   14|
|Information Techn...|   13|
|           Marketing|   13|
| passionate about...|   13|
|            Internet|   12|
|   Computer Software|   12|
|      Not Applicable|   11|
|We offer an excel...|   11|
| además con el fi...|   10|
+--------------------+-----+
only showing top 20 rows



In [9]:
fake_jobs = fake_jobs.filter('fraudulent IN(0, 1)')

In [10]:
fake_jobs.groupBy('fraudulent').count().orderBy('count', ascending=False).show()

+----------+-----+
|fraudulent|count|
+----------+-----+
|         0|16080|
|         1|  886|
+----------+-----+



In [11]:
fake_jobs = fake_jobs.drop('job_id', 'salary_range')

In [12]:
len(fake_jobs.columns)

16

In [13]:
orig_len = fake_jobs.count()
drop_len = fake_jobs.dropna(thresh=10).count()
print('Total Rows that contain 10 null values:', orig_len-drop_len)
print('Percentage of Rows that contain 10 null values value:', (orig_len-drop_len)/orig_len)

Total Rows that contain 10 null values: 2187
Percentage of Rows that contain 10 null values value: 0.12890486856065073


In [14]:
fake_jobs = fake_jobs.dropna(thresh=10)
fake_jobs = fake_jobs.na.fill(' ')

In [15]:
fake_jobs.select([count(when(isnan(c), c)).alias(c) for c in fake_jobs.columns]).limit(5).toPandas()

Unnamed: 0,title,location,department,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [16]:
import pandas as pd

pd.set_option('display.max_colwidth', None)

fake_jobs = fake_jobs.select(concat_ws(' ', fake_jobs.title, fake_jobs.location, fake_jobs.department, fake_jobs.company_profile, 
                          fake_jobs.description, fake_jobs.requirements, fake_jobs.benefits, fake_jobs.employment_type,
                          fake_jobs.required_experience, fake_jobs.required_education, fake_jobs.industry, fake_jobs.function). \
                alias('text'), 'fraudulent')

In [17]:
fake_jobs.toPandas().head(3)

Unnamed: 0,text,fraudulent
0,"Marketing Intern US, NY, New York Marketing We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celebrate home cooks, and give them everything they need in one place.We have a top editorial, business, and engineering team. We're focused on using technology to find new and better ways to connect people around their specific food interests, and to offer them superb, highly curated information about food and cooking. We attract the most talented home cooks and contributors in the country; we also publish well-known professionals like Mario Batali, Gwyneth Paltrow, and Danny Meyer. And we have partnerships with Whole Foods Market and Random House.Food52 has been named the best food website by the James Beard Foundation and IACP, and has been featured in the New York Times, NPR, Pando Daily, TechCrunch, and on the Today Show.We're located in Chelsea, in New York City. Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hub, is currently interviewing full- and part-time unpaid interns to work in a small team of editors, executives, and developers in its New York City headquarters.Reproducing and/or repackaging existing Food52 content for a number of partner sites, such as Huffington Post, Yahoo, Buzzfeed, and more in their various content management systemsResearching blogs and websites for the Provisions by Food52 Affiliate ProgramAssisting in day-to-day affiliate program support, such as screening affiliates and assisting in any affiliate inquiriesSupporting with PR &amp; Events when neededHelping with office administrative work, such as filing, mailing, and preparing for meetingsWorking with developers to document bugs and suggest improvements to the siteSupporting the marketing and executive staff Experience with content management systems a major plus (any blogging counts!)Familiar with the Food52 editorial voice and aestheticLoves food, appreciates the importance of home cooking and cooking with the seasonsMeticulous editor, perfectionist, obsessive attention to detail, maddened by typos and broken links, delighted by finding and fixing themCheerful under pressureExcellent communication skillsA+ multi-tasker and juggler of responsibilities big and smallInterested in and engaged with social media like Twitter, Facebook, and PinterestLoves problem-solving and collaborating to drive Food52 forwardThinks big picture but pitches in on the nitty gritty of running a small company (dishes, shopping, administrative support)Comfortable with the realities of working for a startup: being on call on evenings and weekends, and working long hours Other Internship Marketing",0
1,"Customer Service - Cloud Video Production NZ, , Auckland Success 90 Seconds, the worlds Cloud Video Production Service.90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. 90 Seconds makes video production fast, affordable, and all managed seamlessly in the cloud from purchase to publish. http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing global network of over 2,000 rated video professionals in over 50 countries managed by dedicated production success teams in 5 countries, 90 Seconds provides a 100% success guarantee.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L’Oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo and Singapore.http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630# | http://90#URL_e2ad0bde3f09a0913a486abdbb1e6ac373bb3310f64b1fbcf550049bcba4a17b# | http://90#URL_8c5dd1806f97ab90876d9daebeb430f682dbc87e2f01549b47e96c7bff2ea17e# Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing skills? Maybe Account Management? ...And think administration is cooler than a polar bear on a jetski? Then we need to hear you! We are the Cloud Video Production Service and opperating on a glodal level. Yeah, it's pretty cool. Serious about delivering a world class product and excellent customer service.Our rapidly expanding business is looking for a talented Project Manager to manage the successful delivery of video projects, manage client communications and drive the production process. Work with some of the coolest brands on the planet and learn from a global team that are representing NZ is a huge way!We are entering the next growth stage of our business and growing quickly internationally. Therefore, the position is bursting with opportunity for the right person entering the business at the right time. 90 Seconds, the worlds Cloud Video Production Service - http://90#URL_fbe6559afac620a3cd2c22281f7b8d0eef56a73e3d9a311e2f1ca13d081dd630#90 Seconds is the worlds Cloud Video Production Service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world. Fast, affordable, and all managed seamlessly in the cloud from purchase to publish. 90 Seconds removes the hassle, cost, risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience. With a growing network of over 2,000 rated video professionals in over 50 countries and dedicated production success teams in 5 countries guaranteeing video project success 100%. It's as easy as commissioning a quick google adwords campaign.90 Seconds has produced almost 4,000 videos in over 30 Countries for over 500 Global brands including some of the worlds largest including Paypal, L'oreal, Sony and Barclays and has offices in Auckland, London, Sydney, Tokyo &amp; Singapore.Our Auckland office is based right in the heart of the Wynyard Quarter Innovation Precinct - GridAKL! What we expect from you:Your key responsibility will be to communicate with the client, 90 Seconds team and freelance community throughout the video production process including, shoot planning, securing freelance talent, managing workflow and the online production management system. The aim is to manage each video project effectively so that we produce great videos that our clients love.Key attributesClient focused - excellent customer service and communication skillsOnline - oustanding computer knowledge and experience using online software and project management toolsOrganised - manage workload and able to multi-task100% attention to detailMotivated - self-starter with a passion for doing excellent work and achieving great resultsAdaptable - show initiative and think on your feet as this is a constantly evolving atmosphereFlexible - fast turnaround work and after hours availabilityEasy going &amp; upbeat - dosen't get bogged down and loves the challengeSense of Humour - have a laugh and know that working in a startup takes guts!Ability to deliver - including meeting project deadlines and budgetAttitude is more important than experience at 90 Seconds, however previous experience in customer service and/or project management is beneficialPlease view our platform / website at #URL_395a8683a907ce95f49a12fb240e6e47ad8d5a4f96d07ebbd869c4dd4dea1826# and get a clear understand about what we do before reaching out. What you will get from usThrough being part of the 90 Seconds team you will gain:experience working on projects located around the world with an international brandexperience working with a variety of clients and on a large range of projectsopportunity to drive and grow production function and teama positive working environment with a great teamPay$40,000-$55,000Applying for this role with a VIDEOBeing a video business, we understand that one of the quickest ways that we can assess your suitability for this role, and one of the quickest ways that you can apply for it, is for you to submit a 60-90 second long video telling us about yourself, your experience and why you think you would be perfect for the role. It’s not about being a filmmaker or making a really creative video. A simple video filmed with a smart phone or web cam will be fine. Please also include where you are based and when you can start.You can upload the video onto YouTube or Vimeo (or similar) as a Draft or Live link.APPLICATIONS DUE by 5pm on Wednesday 18th July 2014 - Once you have a video ready, apply for this role via the following link together with a cover letter and your CV. After we have watched your video and get an idea of your suitability for the role, we will email the shortlisted candidates Full-time Not Applicable Marketing and Advertising Customer Service",0
2,"Account Executive - Washington DC US, DC, Washington Sales Our passion for improving quality of life through geography is at the heart of everything we do. Esri’s geographic information system (GIS) technology inspires and enables governments, universities and businesses worldwide to save money, lives and our environment through a deeper understanding of the changing world around them.Carefully managed growth and zero debt give Esri stability that is uncommon in today's volatile business world. Privately held, we offer exceptional benefits, competitive salaries, 401(k) and profit-sharing programs, opportunities for personal and professional growth, and much more. THE COMPANY: ESRI – Environmental Systems Research InstituteOur passion for improving quality of life through geography is at the heart of everything we do. Esri’s geographic information system (GIS) technology inspires and enables governments, universities and businesses worldwide to save money, lives and our environment through a deeper understanding of the changing world around them.Carefully managed growth and zero debt give Esri stability that is uncommon in today's volatile business world. Privately held, we offer exceptional benefits, competitive salaries, 401(k) and profit-sharing programs, opportunities for personal and professional growth, and much more.THE OPPORTUNITY: Account ExecutiveAs a member of the Sales Division, you will work collaboratively with an account team in order to sell and promote adoption of Esri’s ArcGIS platform within an organization. As part of an account team, you will be responsible for facilitating the development and execution of a set of strategies for a defined portfolio of accounts. When executing these strategies you will utilize your experience in enterprise sales to help customers leverage geospatial information and technology to achieve their business goals. Specifically…Prospect and develop opportunities to partner with key stakeholders to envision, develop, and implement a location strategy for their organizationClearly articulate the strength and value proposition of the ArcGIS platformDevelop and maintain a healthy pipeline of opportunities for business growthDemonstrate a thoughtful understanding of insightful industry knowledge and how GIS applies to initiatives, trends, and triggersUnderstand the key business drivers within an organization and identify key business stakeholdersUnderstand your customers’ budgeting and acquisition processesSuccessfully execute the account management process including account prioritization, account resourcing, and account planningSuccessfully execute the sales process for all opportunitiesLeverage and lead an account team consisting of sales and other cross-divisional resources to define and execute an account strategyEffectively utilize and leverage the CRM to manage opportunities and drive the buying processPursue professional and personal development to ensure competitive knowledge of the real estate industryLeverage social media to successfully prospect and build a professional networkParticipate in trade shows, workshops, and seminars (as required)Support visual story telling through effective whiteboard sessionsBe resourceful and takes initiative to resolve issues EDUCATION: Bachelor’s or Master’s in GIS, business administration, or a related field, or equivalent work experience, depending on position levelEXPERIENCE: 5+ years of enterprise sales experience providing platform solutions to businessesDemonstrated experience in managing the sales cycle including prospecting, proposing, and closingAbility to adapt to new technology trends and translate them into solutions that address customer needsDemonstrated experience with strong partnerships and advocacy with customersExcellent presentation, white boarding, and negotiation skills including good listening, probing, and qualification abilitiesExperience executing insight selling methodologiesDemonstrated understanding and mitigation of competitive threatsExcellent written and verbal communication and interpersonal skillsAbility to manage and prioritize your activitiesDemonstrated experience to lead executive engagements to provide services and sell to the real estate industryKnowledge of the real estate industry fiscal year, budgeting, and procurement cycleHighly motivated team player with a mature, positive attitude and passion to meet the challenges and opportunities of a businessAbility to travel domestically and/or internationally up to 50%General knowledge of spatial analysis and problem solvingResults oriented; ability to write and craft smart, attainable, realistic, time-driven goals with clear lead indicators Our culture is anything but corporate—we have a collaborative, creative environment; phone directories organized by first name; a relaxed dress code; and open-door policies.A Place to ThrivePassionate people who strive to make a differenceCasual dress codeFlexible work schedulesSupport for continuing educationCollege-Like CampusA network of buildings amid lush landscaping and numerous outdoor patio areasOn-site café including a Starbucks coffee bar and lounge areaFitness center available 24/7Comprehensive reference library and GIS bibliographyState-of-the-art conference center to host staff and guest speakers Green InitiativesSolar rooftop panels reduce carbon emissionsElectric vehicles provide on-campus transportationHundreds of trees reduce the cost of cooling buildings Full-time Mid-Senior level Bachelor's Degree Computer Software Sales",0


In [18]:
fake_jobs = fake_jobs.withColumn('text', regexp_replace(col('text'), '[^A-Za-z]+', ' '))

fake_jobs = fake_jobs.withColumn('text', lower(regexp_replace(col('text'), '\s+', ' ')))

In [19]:
regex_tokenizer = RegexTokenizer(inputCol='text', outputCol='words', pattern='\\W')

remover = StopWordsRemover(inputCol=regex_tokenizer.getOutputCol(), outputCol='filtered')

indexer = StringIndexer(inputCol='fraudulent', outputCol='label')


pipeline = Pipeline(stages=[regex_tokenizer, remover, indexer])
data_prep_pl = pipeline.fit(fake_jobs)

feature_data = data_prep_pl.transform(fake_jobs)

In [20]:
# import sparknlp
# from sparknlp.base import *
# from sparknlp.annotator import *
# from pyspark.ml import Pipeline 

# documentAssembler = DocumentAssembler().setInputCol('text').setOutputCol('document')

# sentenceDetector = SentenceDetector().setInputCols(['document']).setOutputCol('sentence')

# regexTokenizer = Tokenizer().setInputCols(['sentence']).setOutputCol('tokens')

# stop_words = StopWordsCleaner.pretrained('stopwords_iso','en').setInputCols(['tokens']).setOutputCol('cleanTokens')

# lemma = LemmatizerModel.pretrained('lemma_lines', 'en').setInputCols(['cleanTokens']).setOutputCol('lemma')



# pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, regex_tokenizer, stop_words, lemma])

# feature_data = pipeline.fit(fake_jobs).transform(fake_jobs)

In [21]:
# pipeline = PretrainedPipeline('spellcheck_dl_pipeline', lang = 'en')

# def spell_check(x):
#     ' '.join(pipeline.annotate(text)[0]['checked'])

In [22]:
feature_data.toPandas().head(2)

Unnamed: 0,text,fraudulent,words,filtered,label
0,marketing intern us ny new york marketing we re food and we ve created a groundbreaking and award winning cooking site we support connect and celebrate home cooks and give them everything they need in one place we have a top editorial business and engineering team we re focused on using technology to find new and better ways to connect people around their specific food interests and to offer them superb highly curated information about food and cooking we attract the most talented home cooks and contributors in the country we also publish well known professionals like mario batali gwyneth paltrow and danny meyer and we have partnerships with whole foods market and random house food has been named the best food website by the james beard foundation and iacp and has been featured in the new york times npr pando daily techcrunch and on the today show we re located in chelsea in new york city food a fast growing james beard award winning online food community and crowd sourced and curated recipe hub is currently interviewing full and part time unpaid interns to work in a small team of editors executives and developers in its new york city headquarters reproducing and or repackaging existing food content for a number of partner sites such as huffington post yahoo buzzfeed and more in their various content management systemsresearching blogs and websites for the provisions by food affiliate programassisting in day to day affiliate program support such as screening affiliates and assisting in any affiliate inquiriessupporting with pr amp events when neededhelping with office administrative work such as filing mailing and preparing for meetingsworking with developers to document bugs and suggest improvements to the sitesupporting the marketing and executive staff experience with content management systems a major plus any blogging counts familiar with the food editorial voice and aestheticloves food appreciates the importance of home cooking and cooking with the seasonsmeticulous editor perfectionist obsessive attention to detail maddened by typos and broken links delighted by finding and fixing themcheerful under pressureexcellent communication skillsa multi tasker and juggler of responsibilities big and smallinterested in and engaged with social media like twitter facebook and pinterestloves problem solving and collaborating to drive food forwardthinks big picture but pitches in on the nitty gritty of running a small company dishes shopping administrative support comfortable with the realities of working for a startup being on call on evenings and weekends and working long hours other internship marketing,0,"[marketing, intern, us, ny, new, york, marketing, we, re, food, and, we, ve, created, a, groundbreaking, and, award, winning, cooking, site, we, support, connect, and, celebrate, home, cooks, and, give, them, everything, they, need, in, one, place, we, have, a, top, editorial, business, and, engineering, team, we, re, focused, on, using, technology, to, find, new, and, better, ways, to, connect, people, around, their, specific, food, interests, and, to, offer, them, superb, highly, curated, information, about, food, and, cooking, we, attract, the, most, talented, home, cooks, and, contributors, in, the, country, we, also, publish, well, known, professionals, like, mario, batali, gwyneth, ...]","[marketing, intern, us, ny, new, york, marketing, re, food, ve, created, groundbreaking, award, winning, cooking, site, support, connect, celebrate, home, cooks, give, everything, need, one, place, top, editorial, business, engineering, team, re, focused, using, technology, find, new, better, ways, connect, people, around, specific, food, interests, offer, superb, highly, curated, information, food, cooking, attract, talented, home, cooks, contributors, country, also, publish, well, known, professionals, like, mario, batali, gwyneth, paltrow, danny, meyer, partnerships, whole, foods, market, random, house, food, named, best, food, website, james, beard, foundation, iacp, featured, new, york, times, npr, pando, daily, techcrunch, today, show, re, located, chelsea, new, york, ...]",0.0
1,customer service cloud video production nz auckland success seconds the worlds cloud video production service seconds is the worlds cloud video production service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world seconds makes video production fast affordable and all managed seamlessly in the cloud from purchase to publish http url fbe afac a cd c f b d eef a e d a e f ca d dd seconds removes the hassle cost risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience with a growing global network of over rated video professionals in over countries managed by dedicated production success teams in countries seconds provides a success guarantee seconds has produced almost videos in over countries for over global brands including some of the worlds largest including paypal l oreal sony and barclays and has offices in auckland london sydney tokyo and singapore http url fbe afac a cd c f b d eef a e d a e f ca d dd http url e ad bde f a a abdbb e ac bb f b fbcf bcba a b http url c dd f ab d daebeb f dbc e f b e c bff ea e organised focused vibrant awesome do you have a passion for customer service slick typing skills maybe account management and think administration is cooler than a polar bear on a jetski then we need to hear you we are the cloud video production service and opperating on a glodal level yeah it s pretty cool serious about delivering a world class product and excellent customer service our rapidly expanding business is looking for a talented project manager to manage the successful delivery of video projects manage client communications and drive the production process work with some of the coolest brands on the planet and learn from a global team that are representing nz is a huge way we are entering the next growth stage of our business and growing quickly internationally therefore the position is bursting with opportunity for the right person entering the business at the right time seconds the worlds cloud video production service http url fbe afac a cd c f b d eef a e d a e f ca d dd seconds is the worlds cloud video production service enabling brands and agencies to get high quality online video content shot and produced anywhere in the world fast affordable and all managed seamlessly in the cloud from purchase to publish seconds removes the hassle cost risk and speed issues of working with regular video production companies by managing every aspect of video projects in a beautiful online experience with a growing network of over rated video professionals in over countries and dedicated production success teams in countries guaranteeing video project success it s as easy as commissioning a quick google adwords campaign seconds has produced almost videos in over countries for over global brands including some of the worlds largest including paypal l oreal sony and barclays and has offices in auckland london sydney tokyo amp singapore our auckland office is based right in the heart of the wynyard quarter innovation precinct gridakl what we expect from you your key responsibility will be to communicate with the client seconds team and freelance community throughout the video production process including shoot planning securing freelance talent managing workflow and the online production management system the aim is to manage each video project effectively so that we produce great videos that our clients love key attributesclient focused excellent customer service and communication skillsonline oustanding computer knowledge and experience using online software and project management toolsorganised manage workload and able to multi task attention to detailmotivated self starter with a passion for doing excellent work and achieving great resultsadaptable show initiative and think on your feet as this is a constantly evolving atmosphereflexible fast turnaround work and after hours availabilityeasy going amp upbeat dosen t get bogged down and loves the challengesense of humour have a laugh and know that working in a startup takes guts ability to deliver including meeting project deadlines and budgetattitude is more important than experience at seconds however previous experience in customer service and or project management is beneficialplease view our platform website at url a a ce f a fb e e ad d a f d ebbd c dd dea and get a clear understand about what we do before reaching out what you will get from usthrough being part of the seconds team you will gain experience working on projects located around the world with an international brandexperience working with a variety of clients and on a large range of projectsopportunity to drive and grow production function and teama positive working environment with a great teampay applying for this role with a videobeing a video business we understand that one of the quickest ways that we can assess your suitability for this role and one of the quickest ways that you can apply for it is for you to submit a second long video telling us about yourself your experience and why you think you would be perfect for the role it s not about being a filmmaker or making a really creative video a simple video filmed with a smart phone or web cam will be fine please also include where you are based and when you can start you can upload the video onto youtube or vimeo or similar as a draft or live link applications due by pm on wednesday th july once you have a video ready apply for this role via the following link together with a cover letter and your cv after we have watched your video and get an idea of your suitability for the role we will email the shortlisted candidates full time not applicable marketing and advertising customer service,0,"[customer, service, cloud, video, production, nz, auckland, success, seconds, the, worlds, cloud, video, production, service, seconds, is, the, worlds, cloud, video, production, service, enabling, brands, and, agencies, to, get, high, quality, online, video, content, shot, and, produced, anywhere, in, the, world, seconds, makes, video, production, fast, affordable, and, all, managed, seamlessly, in, the, cloud, from, purchase, to, publish, http, url, fbe, afac, a, cd, c, f, b, d, eef, a, e, d, a, e, f, ca, d, dd, seconds, removes, the, hassle, cost, risk, and, speed, issues, of, working, with, regular, video, production, companies, by, managing, every, aspect, of, video, ...]","[customer, service, cloud, video, production, nz, auckland, success, seconds, worlds, cloud, video, production, service, seconds, worlds, cloud, video, production, service, enabling, brands, agencies, get, high, quality, online, video, content, shot, produced, anywhere, world, seconds, makes, video, production, fast, affordable, managed, seamlessly, cloud, purchase, publish, http, url, fbe, afac, cd, c, f, b, d, eef, e, d, e, f, ca, d, dd, seconds, removes, hassle, cost, risk, speed, issues, working, regular, video, production, companies, managing, every, aspect, video, projects, beautiful, online, experience, growing, global, network, rated, video, professionals, countries, managed, dedicated, production, success, teams, countries, seconds, provides, success, guarantee, seconds, produced, ...]",0.0


In [23]:
word2Vec = Word2Vec(vectorSize=5, minCount=0, inputCol='filtered', outputCol='features')

model = word2Vec.fit(feature_data)

W2VfeaturizedData = model.transform(feature_data)

scaler = MinMaxScaler(inputCol='features', outputCol='scaledFeatures')
scalerModel = scaler.fit(W2VfeaturizedData)

scaled_data = scalerModel.transform(W2VfeaturizedData)

W2VfeaturizedData = scaled_data.select('fraudulent','text','label','scaledFeatures')
W2VfeaturizedData = W2VfeaturizedData.withColumnRenamed('scaledFeatures','features')

W2VfeaturizedData.name = 'W2VfeaturizedData'

In [24]:
def ClassTrainEval(classifier, features, classes, train, test):

    def FindMtype(classifier):
        # Intstantiate Model
        M = classifier
        # Learn what it is
        Mtype = type(M).__name__
        
        return Mtype
    
    Mtype = FindMtype(classifier)
    

    def IntanceFitModel(Mtype, classifier, classes, features, train):
        
        if Mtype == 'OneVsRest':
            # instantiate the base classifier.
            lr = LogisticRegression()
            # instantiate the One Vs Rest Classifier.
            OVRclassifier = OneVsRest(classifier=lr)
#             fitModel = OVRclassifier.fit(train)
            # Add parameters of your choice here:
            paramGrid = ParamGridBuilder() \
                .addGrid(lr.regParam, [0.1, 0.01]) \
                .build()
            #Cross Validator requires the following parameters:
            crossval = CrossValidator(estimator=OVRclassifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=BinaryClassificationEvaluator(),
                                      numFolds=5) # 3 is best practice
            # Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
        if Mtype == 'MultilayerPerceptronClassifier':
            # specify layers for the neural network:
            # input layer of size features, two intermediate of features+1 and same size as features
            # and output of size number of classes
            # Note: crossvalidator cannot be used here
            features_count = len(features[0][0])
            layers = [features_count, features_count+1, features_count, classes]
            MPC_classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
            fitModel = MPC_classifier.fit(train)
            return fitModel
        if Mtype in('LinearSVC','GBTClassifier') and classes != 2: # These classifiers currently only accept binary classification
            print(Mtype,' could not be used because PySpark currently only accepts binary classification data for this algorithm')
            return
        if Mtype in('LogisticRegression','NaiveBayes','RandomForestClassifier','GBTClassifier','LinearSVC','DecisionTreeClassifier'):
  
            # Add parameters of your choice here:
            if Mtype in('LogisticRegression'):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .addGrid(classifier.maxIter, [10, 15, 20])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in('NaiveBayes'):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in('RandomForestClassifier'):
                paramGrid = (ParamGridBuilder() \
                               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in('GBTClassifier'):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .addGrid(classifier.maxIter, [10, 15, 50, 100])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in('LinearSVC'):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.maxIter, [10, 15]) \
                             .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .build())
            
            # Add parameters of your choice here:
            if Mtype in('DecisionTreeClassifier'):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .build())
            
            #Cross Validator requires all of the following parameters:
            crossval = CrossValidator(estimator=classifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=BinaryClassificationEvaluator(),
                                      numFolds=5) # 3 + is best practice
            # Fit Model: Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
    
    fitModel = IntanceFitModel(Mtype, classifier, classes, features, train)
    
    # Print feature selection metrics
    if fitModel is not None:
        
        if Mtype in('OneVsRest'):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(' ')
            print('\033[1m' + Mtype + '\033[0m')
            # Extract list of binary models
            models = BestModel.models
            for model in models:
                print('\033[1m' + 'Intercept: '+ '\033[0m', model.intercept,'\033[1m' + '\nCoefficients:'+ '\033[0m', model.coefficients)

        if Mtype == 'MultilayerPerceptronClassifier':
            print('')
            print('\033[1m' + Mtype,' Weights'+ '\033[0m')
            print('\033[1m' + 'Model Weights: '+ '\033[0m', fitModel.weights.size)
            print('')

        if Mtype in('DecisionTreeClassifier', 'GBTClassifier','RandomForestClassifier'):
            # FEATURE IMPORTANCES
            # Estimate of the importance of each feature.
            # Each feature’s importance is the average of its importance across all trees 
            # in the ensemble The importance vector is normalized to sum to 1. 
            # Get Best Model
            BestModel = fitModel.bestModel
            print(' ')
            print('\033[1m' + Mtype,' Feature Importances'+ '\033[0m')
            print('(Scores add up to 1)')
            print('Lowest score is the least important')
            print(' ')
            print(BestModel.featureImportances)
            
            if Mtype in('DecisionTreeClassifier'):
                global DT_featureimportances
                DT_featureimportances = BestModel.featureImportances.toArray()
                global DT_BestModel
                DT_BestModel = BestModel
            if Mtype in('GBTClassifier'):
                global GBT_featureimportances
                GBT_featureimportances = BestModel.featureImportances.toArray()
                global GBT_BestModel
                GBT_BestModel = BestModel
            if Mtype in('RandomForestClassifier'):
                global RF_featureimportances
                RF_featureimportances = BestModel.featureImportances.toArray()
                global RF_BestModel
                RF_BestModel = BestModel

        if Mtype in('LogisticRegression'):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(' ')
            print('\033[1m' + Mtype,' Coefficient Matrix'+ '\033[0m')
            print('You should compares these relative to eachother')
            print('Coefficients: \n' + str(BestModel.coefficientMatrix))
            print('Intercept: ' + str(BestModel.interceptVector))
            global LR_coefficients
            LR_coefficients = BestModel.coefficientMatrix.toArray()
            global LR_BestModel
            LR_BestModel = BestModel

        if Mtype in('LinearSVC'):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(' ')
            print('\033[1m' + Mtype,' Coefficients'+ '\033[0m')
            print('You should compares these relative to eachother')
            print('Coefficients: \n' + str(BestModel.coefficients))
            global LSVC_coefficients
            LSVC_coefficients = BestModel.coefficients.toArray()
            global LSVC_BestModel
            LSVC_BestModel = BestModel
        
   
    # Set the column names to match the external results dataframe that we will join with later:
    columns = ['Classifier', 'Result']
    
    if Mtype in('LinearSVC','GBTClassifier') and classes != 2:
        Mtype = [Mtype] # make this a list
        score = ['N/A']
        result = spark.createDataFrame(zip(Mtype, score), schema=columns)
    else:
        predictions = fitModel.transform(test)
        MC_evaluator = BinaryClassificationEvaluator(metricName='areaUnderROC') # redictionCol='prediction',
        accuracy = (MC_evaluator.evaluate(predictions))*100
        Mtype = [Mtype] # make this a string
        score = [str(accuracy)] #make this a string and convert to a list
        result = spark.createDataFrame(zip(Mtype, score), schema=columns)
        result = result.withColumn('Result', result.Result.substr(0, 5))
        
    return result
    #Also returns the fit model important scores or p values

In [25]:
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,LinearSVC()
               ,NaiveBayes()
               ,RandomForestClassifier()
               ,GBTClassifier()
               ,DecisionTreeClassifier()
               ,MultilayerPerceptronClassifier()
              ]

In [None]:
train = W2VfeaturizedData.sampleBy('label', fractions={0.0: 0.7, 1.0: 0.7}, seed=16647) 
test = W2VfeaturizedData.subtract(train)

features = W2VfeaturizedData.select(['features']).collect()
class_count = W2VfeaturizedData.select(countDistinct('label')).collect()
classes = class_count[0][0]

columns = ['Classifier', 'Result']
vals = [('Place Holder','N/A')]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier, features, classes, train, test)
    results = results.union(new_result)

results = results.where("Classifier!='Place Holder'")
print(results.show(truncate=False))

 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[-0.88329853, -8.12662828, -4.42253355,  3.47656663,  3.07098487]])

Intercept: [1.3010120391442799]
 
[1mOneVsRest[0m
[1mIntercept: [0m 1.279021407332071 [1m
Coefficients:[0m [-2.3430055667243765,5.110206005470308,1.075032130410985,-1.5474437444272449,-0.9419807272019671]
[1mIntercept: [0m -1.2790214073320674 [1m
Coefficients:[0m [2.3430055667243765,-5.110206005470308,-1.0750321304109898,1.5474437444272442,0.9419807272019641]
 
[1mLinearSVC  Coefficients[0m
You should compares these relative to eachother
Coefficients: 
[-0.02248529327797267,-0.014324214244473138,-0.023552191780259486,0.013239036096754636,0.012966542816556303]
 
[1mRandomForestClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(5,[0,1,2,3,4],[0.22163170826389714,0.19145855614888813,0.1830363620487308,0.1619178896715437,0.24195548386694035])

In [None]:
predictions = RF_BestModel.transform(test)

In [None]:
print('Predicted Failures:')
predictions.select('text', 'fraudulent', 'prediction').filter('prediction=0').toPandas().head(3)

In [None]:
print('Predicted Success:')
predictions.select('text', 'fraudulent', 'prediction').filter('prediction=1').toPandas().head(3)