# Evaluation of Poker hand
  - Project by Akshay Bhatkar

## Objective
  * Prediction of whether the Poker hand is good (recommended to play), bad (recommended to fold) or fair (may try).

## Attribute Information:

* 1) S1 "Suit of card #1" 
  * Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs} 

* 2) C1 "Rank of card #1" 
  * Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King) 

* 3) S2 "Suit of card #2" 
  * Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs} 

* 4) C2 "Rank of card #2" 
  * Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King) 

* 5) S3 "Suit of card #3" 
  * Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs} 

* 6) C3 "Rank of card #3" 
  * Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King) 

* 7) S4 "Suit of card #4" 
  * Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs} 

* 8) C4 "Rank of card #4" 
  * Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King) 

* 9) S5 "Suit of card #5" 
  * Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs} 

* 10) C5 "Rank of card 5" 
  * Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King) 

* 11) CLASS "Poker Hand" 
  * Ordinal (0-9) 

* Class : Poker Hand
  * 0: Nothing in hand; not a recognized poker hand 
  * 1: One pair; one pair of equal ranks within five cards 
  * 2: Two pairs; two pairs of equal ranks within five cards 
  * 3: Three of a kind; three equal ranks within five cards 
  * 4: Straight; five cards, sequentially ranked with no gaps 
  * 5: Flush; five cards with the same suit 
  * 6: Full house; pair + different rank three of a kind 
  * 7: Four of a kind; four equal ranks within five cards 
  * 8: Straight flush; straight + flush 
  * 9: Royal flush; {Ace, King, Queen, Jack, Ten} + flush

## Download the data

In [5]:
%sh
mkdir -p kaggle_project
curl 'http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-testing.data' > kaggle_project/poker_hand.data
ls kaggle_project

In [6]:
#read the poker_hand.data file as CSV
# we keep header option as false since the file does not have any headers and we want default names
df_data = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'false')\
  .option('inferSchema', 'true')\
  .load("file:/databricks/driver/kaggle_project/poker_hand.data")
df_data.show(3)

In [7]:
df_data.printSchema()

## Rename the columns

In [9]:
df = df_data.selectExpr('_c0 as card1suit','_c1 as card1rank', '_c2 as card2suit', '_c3 as card2rank', '_c4 as card3suit', '_c5 as card3rank', '_c6 as card4suit', '_c7 as card4rank', '_c8 as card5suit','_c9 as card5rank', '_c10 as class_poker_hand')

## Check the presence of Dirty Data:
  * Check if suit value is less than 1 or greater than 4
  * Check if rank value is less than 1 or greater than 13
  * Check if class of poker hand has value less than 0 or greater than 9

In [11]:
df_dirtyRank = df[(df.card1rank < 1) | (df.card1rank > 13) | (df.card2rank < 1) | (df.card2rank > 13) | (df.card3rank < 1) | (df.card3rank > 13) | (df.card4rank < 1) | (df.card4rank > 13) | (df.card5rank < 1) | (df.card5rank > 13)]

df_dirtyRank.collect()

In [12]:
df_dirtySuit = df[(df.card1suit > 4) | (df.card1suit < 1) | (df.card2suit < 1)| (df.card2suit > 4)| (df.card3suit < 1)| (df.card3suit > 4)| (df.card4suit < 1)| (df.card4suit > 4) | (df.card5suit < 1)| (df.card5suit > 4)]
df_dirtySuit.collect()

In [13]:
df_dirtyClass = df[(df.class_poker_hand < 0) | (df.class_poker_hand > 13)]
df_dirtyClass.collect()

## Data Cleaning
* From the above results, we conclude that we do not have any dirty data.
* We drop all null values if any with the below command.

In [15]:
df_cleaned = df.dropna()

## Data explored and explained

In [17]:
df_cleaned.show(3)

In [18]:
df_cleaned.printSchema()

In [19]:
df_cleaned.select('class_poker_hand').distinct().orderBy('class_poker_hand').show()

In [20]:
#Check the count of each value for the class of poker hand
df_count = df_cleaned.select('class_poker_hand').orderBy('class_poker_hand').groupBy('class_poker_hand').count()

display(df_count)

## Data transformation

  * We add a new column "grade" to the DataFrame. It can have the following values:
    * bad -> Not a recognized poker hand -> recommended to fold
    * fair -> One pair -> Can try your luck and play
    * good -> All hands better than 1 pair -> Recommeded to play since chances of winning are high

In [22]:
lookup = sqlContext.createDataFrame([(0,'Bad'),(1,'Fair'),(2,'Good'),(3,'Good'),(4,'Good'),(5,'Good'),(6,'Good'),(7,'Good'), (8,'Good'), (9, 'Good')], ('k','v'))

df_2 = (df_cleaned.join(lookup, df_cleaned["class_poker_hand"] == lookup["k"], "leftouter")
    .drop("k")
    .withColumnRenamed("v", "grade"))
df_2.show(3)
# df_2.show()

In [23]:
display(df_2.select('grade').groupBy('grade').count())

## Data Modeling

In [25]:
#Split the data into 80% train data and 20% test data
splitted_data = df_2.randomSplit([0.8, 0.2], 24)
train_data = splitted_data[0]
test_data = splitted_data[1]

print "Number of training records: " + str(train_data.count())
print "Number of testing records : " + str(test_data.count())

In [26]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

In [27]:
stringIndexer_label = StringIndexer(inputCol="grade", outputCol="label").fit(df_2)
# Note the above outputCol is label (the predicted column). Here we predict the class of the poker hand from the attributes below.
stringIndexer_c1s = StringIndexer(inputCol="card1suit", outputCol="card1suit_IX")
stringIndexer_c2s = StringIndexer(inputCol="card2suit", outputCol="card2suit_IX")
stringIndexer_c3s = StringIndexer(inputCol="card3suit", outputCol="card3suit_IX")
stringIndexer_c4s = StringIndexer(inputCol="card4suit", outputCol="card4suit_IX")
stringIndexer_c5s = StringIndexer(inputCol="card5suit", outputCol="card5suit_IX")

stringIndexer_c1r = StringIndexer(inputCol="card1rank", outputCol="card1rank_IX")
stringIndexer_c2r = StringIndexer(inputCol="card2rank", outputCol="card2rank_IX")
stringIndexer_c3r = StringIndexer(inputCol="card3rank", outputCol="card3rank_IX")
stringIndexer_c4r = StringIndexer(inputCol="card4rank", outputCol="card4rank_IX")
stringIndexer_c5r = StringIndexer(inputCol="card5rank", outputCol="card5rank_IX")


In [28]:
# Select the input columns for the model (and put them into one features column)
vectorAssembler_features = VectorAssembler(inputCols=["card1suit_IX", "card2suit_IX", "card3suit_IX", "card4suit_IX", "card5suit_IX", "card1rank_IX", "card2rank_IX", "card3rank_IX", "card4rank_IX", "card5rank_IX"], outputCol="features")

In [29]:
# The model - RandomForest
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

In [30]:
# Columns for the output
# Convert from indexed labels (added above) back to original labels.
# https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.feature.IndexToString
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=stringIndexer_label.labels)

In [31]:
# The ML pipeline
pipeline_rf = Pipeline(stages=[stringIndexer_label, stringIndexer_c1s, stringIndexer_c2s, stringIndexer_c3s, stringIndexer_c4s, stringIndexer_c5s, stringIndexer_c1r, stringIndexer_c2r, stringIndexer_c3r, stringIndexer_c4r, stringIndexer_c5r, vectorAssembler_features, rf, labelConverter])

In [32]:
# Model training
model_rf = pipeline_rf.fit(train_data)

## Model Evaluation

In [34]:
# Model quality
predictions = model_rf.transform(test_data)
evaluatorRF = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluatorRF.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

In [35]:
display(predictions)

In [36]:
#Showing only 2 of the 3 predictions
predictions.select('prediction').distinct().show()

In [37]:
correct = predictions.where("(label = prediction)").count()
incorrect = predictions.where("(label != prediction)").count()

resultDF = sqlContext.createDataFrame([['correct', correct], ['incorrect', incorrect]], ['metric', 'value'])
display(resultDF)