# PyDrill With Docker

Now that we've processed our data in Spark, we now have the ability to query the data in Apache Drill. This notebook walks you through querying in Apache Drill, running in a Docker container.

In [1]:
# titanic = pd.read_json("spark_titanic.json")
# titanic.to_json('spark_titanic_v3.json', orient='records', lines=True)

In [2]:
from pydrill.client import PyDrill

In [3]:
drill = PyDrill(host='localhost', port=8047)

## Query Spark Titanic Data With Drill 

Now that the data's been processed in Spark, we can now query the data in Apache Drill.

First, select all colums from the `spark_titanic.json` file.

Notice that because the file is prefaced by the direcotry we mounted to in Drill. If the mount directory isn't specified, you won't be able to query the file!

In [4]:
drill.query('''SELECT * FROM dfs.`drill/spark_titanic_v3.json` limit 2 ''').to_dataframe()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,date_time_received,date_time_sent,time_time_received,time_time_sent
0,22.0,__NA__,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,0,A/5 21171,2018-12-08,2018-12-08,08:22:33,08:16:44
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1,PC 17599,2018-12-08,2018-12-08,08:22:33,08:16:18


Remember how we replaced missing ages with the average age in Spark? Let's query the `Age` column to get the average age.

In [5]:
drill.query('''SELECT AVG(age) as avg_age FROM dfs.`drill/spark_titanic_v3.json`''').to_dataframe()

Unnamed: 0,avg_age
0,29.71360820244327


When people embarked on the Titanic, the ticket prices varied substantially by class. Let's review what each class paid on average per ticket.

The average payment can be viewed by performing an aggregation on the data.

drill.query('''
    SELECT Pclass, AVG(Fare) as avg_fare
    FROM dfs.`drill/spark_titanic_v3.json`
    GROUP BY Pclass
    ORDER BY Pclass ''').to_dataframe()

`Pclass` 1 definitely paid more than the other classes; they're first class. But how many people were in each class? We can change the `AVG` aggregation to a `COUNT` and get this number.

In [6]:
drill.query('''
    SELECT Pclass, COUNT(*) as total_passengers
    FROM dfs.`drill/spark_titanic_v3.json`
    GROUP BY Pclass
    ORDER BY Pclass ''').to_dataframe()

Unnamed: 0,Pclass,total_passengers
0,1,1105
1,2,942
2,3,2537


In this case, third class has substantially more passengers. But because of how low the ticket prices were in third class, it's understandable why the average fare would be so low.

## Querying By Name 

What if we wanted to query by text in the name? SQL lets you do this, and by extension, PyDrill.

Runt the query to see how many people had the salutation `Mr.` in their title.

In [7]:
drill.query('''
    SELECT COUNT(*) as total_mr
    FROM dfs.`drill/spark_titanic_v3.json`
    WHERE Name LIKE '%Mr.%' ''').to_dataframe()

Unnamed: 0,total_mr
0,2657


Change the query to see how many people had the salutation `Mrs.` in their title.

In [8]:
drill.query('''
    SELECT COUNT(*) as total_mrs
    FROM dfs.`drill/spark_titanic_v3.json`
    WHERE Name LIKE '%Mrs.%' ''').to_dataframe()

Unnamed: 0,total_mrs
0,642


## Survival By Gender

Finally, we'll see how many people actually survived.

In [9]:
drill.query('''
    SELECT Survived, COUNT(DISTINCT PassengerId) AS total_passengers
    FROM dfs.`drill/spark_titanic_v3.json`
    GROUP BY Survived
    ORDER BY Survived
    ''').to_dataframe()

Unnamed: 0,Survived,total_passengers
0,0,549
1,1,342


Let's drill down further into the data, and break the data down by Pclass and Sex.

In [10]:
drill.query('''
    SELECT Survived, Sex, Pclass, COUNT(DISTINCT PassengerId) AS total_passengers
    FROM dfs.`drill/spark_titanic_v3.json`
    GROUP BY Survived, Sex, Pclass
    ORDER BY Pclass, Survived
    ''').to_dataframe()

Unnamed: 0,Pclass,Sex,Survived,total_passengers
0,1,male,0,77
1,1,female,0,3
2,1,female,1,91
3,1,male,1,45
4,2,male,0,91
5,2,female,0,6
6,2,female,1,70
7,2,male,1,17
8,3,male,0,300
9,3,female,0,72


This aggregation shows a common trend known with Titanic; not only did the majority of people die, but the deaths were more substantial amongst men than women. The survival rate also declines substantially as you go down in class.

## Conclusion

This wraps up querying the data you processed in Apache Spark within Apache Drill!