# Fictional Army - Filtering and Sorting

### Introduction:

This exercise was inspired by this [page](http://chrisalbon.com/python/)

Special thanks to: https://github.com/chrisalbon for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("army").getOrCreate()
spark

### Step 2. This is the data given as a dictionary

In [2]:
# Create an example dataframe about a fictional army
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
            'deaths': [523, 52, 25, 616, 43, 234, 523, 62, 62, 73, 37, 35],
            'battles': [5, 42, 2, 2, 4, 7, 8, 3, 4, 7, 8, 9],
            'size': [1045, 957, 1099, 1400, 1592, 1006, 987, 849, 973, 1005, 1099, 1523],
            'veterans': [1, 5, 62, 26, 73, 37, 949, 48, 48, 435, 63, 345],
            'readiness': [1, 2, 3, 3, 2, 1, 2, 3, 2, 1, 2, 3],
            'armored': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1],
            'deserters': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'origin': ['Arizona', 'California', 'Texas', 'Florida', 'Maine', 'Iowa', 'Alaska', 'Washington', 'Oregon', 'Wyoming', 'Louisana', 'Georgia']}

### Step 3. Create a dataframe and assign it to a variable called army. 

#### Don't forget to include the columns names in the order presented in the dictionary ('regiment', 'company', 'deaths'...) so that the column index order is consistent with the solutions. If omitted, pandas will order the columns alphabetically.

In [3]:
import pandas as pd

In [20]:
army = spark.createDataFrame(pd.DataFrame(data=raw_data))
army.show()

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|     Texas|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|
|  Dragoons|    1st|    43|      4|1592|      73|        2|      0|        3|     Maine|
|  Dragoons|    1st|   234|      7|1006|      37|        1|      1|        4|      Iowa|
|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|    Alaska|
|  Dragoons|    2nd|    62|      3| 849|      48|        3|      1|       31|Washington|
|    Scouts|    1st| 

### Step 4. Set the 'origin' colum as the index of the dataframe

In [19]:
#trying to find an answer

### Step 5. Print only the column veterans

In [23]:
army.select("veterans").show(5)

+--------+
|veterans|
+--------+
|       1|
|       5|
|      62|
|      26|
|      73|
+--------+
only showing top 5 rows



### Step 6. Print the columns 'veterans' and 'deaths'

In [24]:
army.select("veterans","deaths").show(5)

+--------+------+
|veterans|deaths|
+--------+------+
|       1|   523|
|       5|    52|
|      62|    25|
|      26|   616|
|      73|    43|
+--------+------+
only showing top 5 rows



### Step 7. Print the name of all the columns.

In [25]:
print(army.columns)

['regiment', 'company', 'deaths', 'battles', 'size', 'veterans', 'readiness', 'armored', 'deserters', 'origin']


### Step 8. Select the 'deaths', 'size' and 'deserters' columns from Maine and Alaska

In [26]:
origins = ["Maine","Alaska"]
cols = ["deaths","size","deserters"]
army.select(cols).filter(army.origin.isin(origins)).show()

+------+----+---------+
|deaths|size|deserters|
+------+----+---------+
|    43|1592|        3|
|   523| 987|       24|
+------+----+---------+



### Step 9. Select the rows 3 to 7 and the columns 3 to 6

In [29]:
army.select(army.columns[2:6]).collect()[2:7]

[Row(deaths=25, battles=2, size=1099, veterans=62),
 Row(deaths=616, battles=2, size=1400, veterans=26),
 Row(deaths=43, battles=4, size=1592, veterans=73),
 Row(deaths=234, battles=7, size=1006, veterans=37),
 Row(deaths=523, battles=8, size=987, veterans=949)]

### Step 10. Select every row after the fourth row and all columns

In [30]:
army.collect()[3:]

[Row(regiment='Nighthawks', company='2nd', deaths=616, battles=2, size=1400, veterans=26, readiness=3, armored=1, deserters=2, origin='Florida'),
 Row(regiment='Dragoons', company='1st', deaths=43, battles=4, size=1592, veterans=73, readiness=2, armored=0, deserters=3, origin='Maine'),
 Row(regiment='Dragoons', company='1st', deaths=234, battles=7, size=1006, veterans=37, readiness=1, armored=1, deserters=4, origin='Iowa'),
 Row(regiment='Dragoons', company='2nd', deaths=523, battles=8, size=987, veterans=949, readiness=2, armored=0, deserters=24, origin='Alaska'),
 Row(regiment='Dragoons', company='2nd', deaths=62, battles=3, size=849, veterans=48, readiness=3, armored=1, deserters=31, origin='Washington'),
 Row(regiment='Scouts', company='1st', deaths=62, battles=4, size=973, veterans=48, readiness=2, armored=0, deserters=2, origin='Oregon'),
 Row(regiment='Scouts', company='1st', deaths=73, battles=7, size=1005, veterans=435, readiness=1, armored=0, deserters=3, origin='Wyoming'),
 

### Step 11. Select every row up to the 4th row and all columns

In [31]:
army.collect()[:4]

[Row(regiment='Nighthawks', company='1st', deaths=523, battles=5, size=1045, veterans=1, readiness=1, armored=1, deserters=4, origin='Arizona'),
 Row(regiment='Nighthawks', company='1st', deaths=52, battles=42, size=957, veterans=5, readiness=2, armored=0, deserters=24, origin='California'),
 Row(regiment='Nighthawks', company='2nd', deaths=25, battles=2, size=1099, veterans=62, readiness=3, armored=1, deserters=31, origin='Texas'),
 Row(regiment='Nighthawks', company='2nd', deaths=616, battles=2, size=1400, veterans=26, readiness=3, armored=1, deserters=2, origin='Florida')]

### Step 12. Select the 3rd column up to the 7th column

In [33]:
army.select(army.columns[2:7]).show(5)

+------+-------+----+--------+---------+
|deaths|battles|size|veterans|readiness|
+------+-------+----+--------+---------+
|   523|      5|1045|       1|        1|
|    52|     42| 957|       5|        2|
|    25|      2|1099|      62|        3|
|   616|      2|1400|      26|        3|
|    43|      4|1592|      73|        2|
+------+-------+----+--------+---------+
only showing top 5 rows



### Step 13. Select rows where deaths is greater than 50

In [34]:
army.filter("deaths>50").show()

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|
|  Dragoons|    1st|   234|      7|1006|      37|        1|      1|        4|      Iowa|
|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|    Alaska|
|  Dragoons|    2nd|    62|      3| 849|      48|        3|      1|       31|Washington|
|    Scouts|    1st|    62|      4| 973|      48|        2|      0|        2|    Oregon|
|    Scouts|    1st|    73|      7|1005|     435|        1|      0|        3|   Wyoming|
+----------+-------+-

### Step 14. Select rows where deaths is greater than 500 or less than 50

In [37]:
army.where((army.deaths>500) | (army.deaths<50)).show()

+----------+-------+------+-------+----+--------+---------+-------+---------+--------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|  origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+--------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4| Arizona|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|   Texas|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2| Florida|
|  Dragoons|    1st|    43|      4|1592|      73|        2|      0|        3|   Maine|
|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|  Alaska|
|    Scouts|    2nd|    37|      8|1099|      63|        2|      1|        2|Louisana|
|    Scouts|    2nd|    35|      9|1523|     345|        3|      1|        3| Georgia|
+----------+-------+------+-------+----+--------+---------+-------+---------+--------+



In [38]:
army.filter((army.deaths>500) | (army.deaths<50)).show()

+----------+-------+------+-------+----+--------+---------+-------+---------+--------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|  origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+--------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4| Arizona|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|   Texas|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2| Florida|
|  Dragoons|    1st|    43|      4|1592|      73|        2|      0|        3|   Maine|
|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|  Alaska|
|    Scouts|    2nd|    37|      8|1099|      63|        2|      1|        2|Louisana|
|    Scouts|    2nd|    35|      9|1523|     345|        3|      1|        3| Georgia|
+----------+-------+------+-------+----+--------+---------+-------+---------+--------+



#### "Filter" & "Where" are used interchangibly due to their similar funtionality

### Step 15. Select all the regiments not named "Dragoons"

In [39]:
army.where(army.regiment != "Dragoons").show()

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|     Texas|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|
|    Scouts|    1st|    62|      4| 973|      48|        2|      0|        2|    Oregon|
|    Scouts|    1st|    73|      7|1005|     435|        1|      0|        3|   Wyoming|
|    Scouts|    2nd|    37|      8|1099|      63|        2|      1|        2|  Louisana|
|    Scouts|    2nd|    35|      9|1523|     345|        3|      1|        3|   Georgia|
+----------+-------+-

### Step 16. Select the rows called Texas and Arizona

In [40]:
origins = ["Texas", "Arizona"]
army.where(army.origin.isin(origins)).show()

+----------+-------+------+-------+----+--------+---------+-------+---------+-------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters| origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+-------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|Arizona|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|  Texas|
+----------+-------+------+-------+----+--------+---------+-------+---------+-------+



### Step 17. Select the third cell in the row named Arizona

In [48]:
army.where(army.origin == "Arizona").select(army.columns[2]).head(1)[0][0]

523

### Step 18. Select the third cell down in the column named deaths

In [53]:
army.select("deaths").collect()[2]

Row(deaths=25)