## Introduction to Pandas - Practice Challenge

In [None]:
import pandas as pd

from unittest import TestCase, TestLoader, TextTestRunner

def runTest(case):
    suite = TestLoader().loadTestsFromModule(case())
    TextTestRunner().run(suite)             

## This Week's Aims

The Titanic survivor dataset turns up frequently in introductions to machine learning.
Have a look at the [data](https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv). It's a big .csv (comma separated variable) file. Python has a [module](https://docs.python.org/3/library/csv.html) in the standard library for dealing with them, but that's too much like hard work. We don't want to write loads of code that repeatedly loops over the data to gain the insights we want. There is an easier way. The [Pandas](https://pandas.pydata.org/) library provides a "complex type" called a dataframe, a bit like the ones in the R language.

We'll try some basic techniques to get some decent insights from the data, *without writing much code*. If you write more than a half-dozen lines in most of these challenges, you've probably gone wrong! Hopefully, you should be able to point Pandas at most structured data that grabs your interest, and draw conclusions from it with minimal effort.

In [None]:
titanic_url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

All we need to do to put the csv in a dataframe is to pass the url to the Pandas ["read_csv" method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html):

In [None]:
titanic = pd.read_csv(titanic_url)

Jupyter can display dataframes quite nicely:

In [None]:
titanic

Dataframes have a useful ["describe" method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html), we can see here that the average age of a passenger was about 30. Since the "Survived" column is either 0 or 1, we can tell that 38% of passengers survived. Not *all* the counts equal 891. There's a lot of missing data, but Pandas mostly copes.

In [None]:
titanic.describe()

### Part 1: Selecting and Slicing
You can select columns from a dataframe a bit like a dictionary, and slice the rows a bit like a list. Here are the first 5 names:

In [None]:
titanic['Name'][0:5]

To select multiple columns, you use a list:

In [None]:
titanic[['Name','Pclass']][0:5]

Now it's your turn. Complete the function "ageFare" that returns a dataframe with the first *n* ages and fares of the dataframe *df*. The cell will test the function to see  if you've got it right.

In [None]:
class TestAgeFare(TestCase):
    """Don't worry too much about how these tests work. In this case:
    Your function should return a Pandas dataframe.
    It should *only* have the colmuns "Age" and "Fare".
    The ages and fares should have the correct values. 
    """
    
    def setUp(self):
        try:
            self.result = ageFare(titanic,15)
        except:
            self.result = None
        try:
            self.result_cols = list(self.result.columns)
        except:
            self.result_cols = []
        try:
            self.result_sum = self.result.sum()
        except:
            self.result_sum = dict()
            
    def test_dataframe_returned(self):
        self.assertEqual(str(self.result.__class__),"<class 'pandas.core.frame.DataFrame'>")
        
    def test_correct_columns(self):
         self.assertEqual(list(self.result_cols),['Age', 'Fare'])
            
    def test_correct_ages(self):
        self.assertEqual(int(self.result_sum.get('Age',0)),388)
        
    def test_correct_fares(self):
        self.assertEqual(int(self.result_sum.get('Fare',0)),360)
        
def ageFare(df,n):
    # Your code goes here: (This is a one-liner!)
    pass

runTest(TestAgeFare)

### Part 2: Sorting
Use the ["DataFrame.sort_values"](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method to complete the "getTopFares" function, which should return the *n* highest fares from the DataFrame *df*.

In [None]:
class TestGetTopFares(TestCase):
    """Don't worry too much about how these tests work...
    (...unless you want to.)
    """
    
    def testTopSixFares(self):
        total = int(sum(getTopFares(titanic,6)['Fare']))
        self.assertEqual(total,2325)

def getTopFares(df,n):
    # Your code, a one-liner, here:
    pass

runTest(TestGetTopFares)

It's also quite easy to filter cells according to criteria, we're interested in who survived:

In [None]:
survived = titanic['Survived'] == 1

Survived is just a list of boolean (true or false) values:

In [None]:
survived[0:5]

We can now use this list (or *series*) to reference our original dataframe, and only show survivors:

In [None]:
titanic[survived]

We can filter on other conditions. Here are the younger than average passengers, done in a single step:

In [None]:
titanic[titanic['Age'] < 30][0:15]

Use the "~" operator to invert the selecion, and return the older passengers:

In [None]:
titanic[~(titanic['Age'] < 30)][0:15]

### Part 3: Bad Data
What's this? Some of the ages of the older passengers are "NaN", or "not a number"! How come "NaN" is greater than 30? Read the documentation for the ["dropna" method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) of dataframe objects. Complete the function "onlyDefinedAges" that returns the dataframe *df* with all the rows with NaN ages removed. Everything else should be kept.

In [None]:
class TestOnlyDefinedAges(TestCase):
    """Don't worry too much about how these tests work...
    (...unless you want to.)
    """
    
    def test_all_columns_present(self):
        self.assertEqual(list(titanic.columns),list(onlyDefinedAges(titanic).columns))
    
    def test_passenter_ids(self):
        subset = titanic[0:40]
        ids = list(subset[pd.notnull(subset['Age'])]['PassengerId'])
        self.assertEqual(ids,list(onlyDefinedAges(subset)['PassengerId']))
    

def onlyDefinedAges(df):
    #Your code here. This should be a one-liner!
    pass
    
runTest(TestOnlyDefinedAges)                        

Now there shouldn't be any NaN values in the age column:

In [None]:
onlyDefinedAges(titanic[titanic['Age'] >= 30])[0:15]

### Part 4: Filtering
The "SibSp" and "Parch" columns refer to the numbers of siblings or spouses and parents or children traveling with each passenger. We can use the "&" operator to see how many passengers travelled alone.

In [None]:
alone = titanic[(titanic['SibSp'] == 0) & (titanic['Parch'] == 0)]
alone.count()

Your turn again. Complete the function "sexAndSurvival" which takes a dataframe *df*, and returns a filtered version, according to the  boolean parameters *female* and *survivor*, both of which default to "True".

In [None]:
class TestSexSurvival(TestCase):
    """Don't worry too much about how these tests work...
    (...unless you want to.)
    """
            
    def test_dead_and_male(self):
        dead_and_male = sexAndSurvival(titanic,False,False)
        self.assertTrue(all(list(dead_and_male['Sex'] == 'male')))
        self.assertTrue(all(list(dead_and_male['Survived'] == 0)))
        
    def test_dead_and_female(self):
        dead_and_female = sexAndSurvival(titanic,True,False)
        self.assertTrue(all(list(dead_and_female['Sex'] == 'female')))
        self.assertTrue(all(list(dead_and_female['Survived'] == 0)))
        
    def test_alive_and_male(self):
        alive_and_male = sexAndSurvival(titanic,False,True)
        self.assertTrue(all(list(alive_and_male['Sex'] == 'male')))
        self.assertTrue(all(list(alive_and_male['Survived'] == 1)))
        
    def test_alive_and_female(self):
        alive_and_female = sexAndSurvival(titanic,True,True)
        self.assertTrue(all(list(alive_and_female['Sex'] == 'female')))
        self.assertTrue(all(list(alive_and_female['Survived'] == 1)))
        
def sexAndSurvival(df,female=True,survivor=True):
    # Your code here. Maybe half-a-dozen lines?
    pass
    
runTest(TestSexSurvival)

If your function works, you can see that being male really didn't help:

In [None]:
total_female = titanic[titanic['Sex'] == 'female'].count()['PassengerId']
total_male = titanic[titanic['Sex'] == 'male'].count()['PassengerId']
female_survivors = sexAndSurvival(titanic,female=True, survivor=True).count()['Survived']
male_survivors = sexAndSurvival(titanic,female=False, survivor=True).count()['Survived']
print(100 * female_survivors / total_female, 'percent of female passengers survived')
print(100 * male_survivors / total_male, 'percent of male passengers survived')

### Part 5: Grouping and Aggregating
Now we shall explore the very powerfull ["groupby"](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) and ["aggregate"](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.aggregate.html) functions. Here, the dataframe is partioned according to the passenger class. To get a dataframe back, we need the aggregate function. The set of rows for each partition is passed to a function. You can even use a dictionary to specify a different function for each column. We'll count the passenger IDs, to get the number of passengers in each class, then sum the Survived column to get the number of survivors in each class.

In [None]:
survivors_by_class = titanic.groupby('Pclass').aggregate({'PassengerId':'count', 'Survived':'sum'})

In [None]:
survivors_by_class

Now for some more "magic". We can add a column to a dataframe just like adding a value to a dictionary. We can also do operations on entire columns, so if we divide the sum of the survivors by the number of passengers, we get the chance of survival for each class:

In [None]:
survivors_by_class['Survival %'] = 100 * survivors_by_class['Survived'] / survivors_by_class['PassengerId']
survivors_by_class

Does being first class help even if you're female? Does being first class mitigate being male?

In [None]:
fem_survivors_by_class = titanic[titanic['Sex'] == 'female'].groupby('Pclass').aggregate(
    {'PassengerId':'count', 'Survived':'sum'})
fem_survivors_by_class['Survival %'] = (100 * fem_survivors_by_class['Survived'] 
    / fem_survivors_by_class['PassengerId'])
fem_survivors_by_class

In [None]:
male_survivors_by_class = titanic[titanic['Sex'] == 'male'].groupby('Pclass').aggregate(
    {'PassengerId':'count', 'Survived':'sum'})
male_survivors_by_class['Survival %'] = (100 * male_survivors_by_class['Survived'] 
    / fem_survivors_by_class['PassengerId'])
male_survivors_by_class

Now we shall experiment with grouping the data into what are in fact, technically called *bins*. We'll make a copy of our dataframe in case we mess it up. If you mess it up anyway, just re-run the cell that created the "titanic" dataframe at the top of the notebooks.

In [None]:
titanic_fare_bins = titanic.copy()

You are provided with the "bins_and_labels" function. We'll choose bins for the fare prices £50 wide, with edges from £0-£300. (Or, were the fares in Dollars?)

In [None]:
def bins_and_labels(maximum,step):
    """Return bins edges of width <step>, from 0-<maximum>, and also labels for the bins.
    """
    bins = list(range(0,maximum+step,step))
    return bins, ['-'.join([str(a), str(b)]) for a,b in zip(bins,bins[1:])]

In [None]:
fare_bins, fare_labels = bins_and_labels(300,50)
print("fare bins",fare_bins)
print("fare labels",fare_labels)

Now we shall employ the [Pandas.cut](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html) function. *fare_ranges* is a series of bins for the "Fare" column, each row has the bin label for each fare. We'll add it to our dataframe as the "Fare Range" column.

In [None]:
fare_ranges = pd.cut(titanic_fare_bins['Fare'],bins=fare_bins, labels=fare_labels)
titanic_fare_bins['Fare Range'] = fare_ranges
titanic_fare_bins[0:5]

The "groupby" and "aggregate" trick shows that you *really* didn't want a ticket cheaper than $50:

In [None]:
titanic_fare_bins.groupby('Fare Range').aggregate({'PassengerId':'count', 'Survived':'sum'})

Your turn again. Write a function that adds a column called "Age Range" to the dataframe, using 10-year bins from 0-80. You should get rid of NaN ages first, use your existing function if you like. All this should happen in the " addAgeRange" function.

In [None]:
class TestAddAgeRange(TestCase):
    """Don't worry too much about how these tests work.
    Tests are usually a good idea, though.
    """
    
    def setUp(self):
        try:
            self.data = addAgeRange(titanic)
        except:
            self.data = titanic.copy()
            
    def test_columns(self):
        cols = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin',
            'Embarked','Age Range']
        self.assertEqual(cols, list(self.data.columns))
        
    def test_30_40_fares(self):
        self.assertEqual(int(sum(self.data[self.data['Age Range'] == '30-40']['Fare'])),6586)
        
    def test_40_50_survival(self):
        self.assertEqual(sum(self.data[self.data['Age Range'] == '40-50']['Survived']),33)
        
    def test_50_60_count(self):
        self.assertEqual(len(self.data[self.data['Age Range'] == '50-60']),42)
        
def addAgeRange(df):
    new_df = df.copy()
    # Your code starts here: (About half-a-dozen-lines.)
    return new_df

runTest(TestAddAgeRange)

Pandas can automatically bin and and plot data for us with the [dataframe.hist](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html) method. The age distributions of the survivors (or otherwise) don't tell us much, though. We need the age distribution of the fraction of survivors, so let's use our last function to do that:

In [None]:
%matplotlib inline
titanic[titanic['Survived'] == 1]['Age'].hist()

In [None]:
%matplotlib inline
titanic[titanic['Survived'] == 0]['Age'].hist()

In [None]:
age_bin_df = addAgeRange(titanic)
survivors_by_age = age_bin_df.groupby('Age Range').aggregate({'PassengerId':'count', 'Survived':'sum'})
survivors_by_age['Survival %'] = 100 * survivors_by_age['Survived'] / survivors_by_age['PassengerId']
survivors_by_age

Women and children first, it would seem. The very young do quite well, passengers from 10-60 have around a 40% chance, then there's a big drop-off.

### Part 6: String Operations
This last bit's a little tougher. We'll introduce the [split method](https://docs.python.org/3/library/stdtypes.html#str.split) of a string:

In [None]:
pangram = "The quick brown fox jumps over the lazy dog"
pangram.split()

We don't have to split on spaces, if we pass in another string, (like a comma) we can split on that instead.

In [None]:
lily_may_peel = "Futrelle, Mrs. Jacques Heath (Lily May Peel) "
lily_may_peel.split(',')

Now we introduce the concept of [map](https://docs.python.org/3/library/functions.html#map). Map applies a given function to each element of a series or *iterable*, which could be a list. So the "get_uppercase" function is applied to all the words in our pangram, we turn the results into a list so we can print them. It's a bit clumsy to have a seperate function just to call the "upper" method of our strings. We can use a [lambda expression](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions). It's a nameless or *anonymous* function we can pass straight in. The effect is the same, don't use them if you don't want to.

In [None]:
def get_uppercase(s):
    return s.upper()

print(list(map(get_uppercase, pangram.split())))
print(list(map(lambda x: x.upper(), pangram.split())))

Now let's put that all togther into something useful. The "Name" column of the dataframe is a list of comma-separated terms, usually starting with a surname. (No, not everyone's name works like that.) We shall add a "Surname" column to our dataframe. Here's how:

1. Use the ["Pandas.str.split"](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html) function to split the "Name" column on commas.
2. Use the ["Pandas.Series.map"](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) function to get the first item from each list of comma-separated strings. You can complete the "firstItem" function to do this, or you can use a lambda expression instead.
3. Add this series as a new column called "Surname" to a copy of our dataframe.

In [None]:

class TestAddSurnames(TestCase):
    
    def setUp(self):
        try:
            self.data = addSurnames(titanic)
        except:
            self.data = titanic.copy()
    
    def testColumns(self):
        true_cols = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket',
            'Fare','Cabin','Embarked','Surname']
        self.assertEqual(true_cols,list(self.data.columns))
        
    def testQuasiRandomNames(self):
        names = self.data.groupby('Surname').aggregate('count').sort_values('PassengerId',
            ascending=False).reset_index()
        true_names = list(map(lambda i: names.iloc[i][0],[5,22,9,16,56,48,103]))
        self.assertEqual(true_names, ['Johnson', 'Bourke', 'Fortune', 'Brown', 'Coleff', 'Richards', 'Beckwith'])

def addSurnames(df):
    new_df = df.copy()
    # Your code here: (You can do this by adding three lines.)
    return new_df
 
runTest(TestAddSurnames)

One last bit of messing around with groupby, aggregation and sorting to return the most common surnames on the Titanic. What does ["reset_index"](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html) do?

In [None]:
top_names = addSurnames(titanic).groupby('Surname').aggregate('count').sort_values('PassengerId',
    ascending=False).reset_index()[['Surname','PassengerId']]
top_names

Run the last two cells to check everything works.

In [None]:
def testAll():
    all_the_tests = [TestAgeFare, TestGetTopFares, TestOnlyDefinedAges, TestSexSurvival, TestAddAgeRange,
        TestAddSurnames]
    total_tests = 0
    total_passes = 0
    runner = TextTestRunner(verbosity=0)
    for test in all_the_tests:
        suite = TestLoader().loadTestsFromModule(test())
        count = suite.countTestCases()
        total_tests += count
        result = runner.run(suite)
        total_passes += count - len(result.errors) - len(result.failures)
        print()
    return total_tests, total_passes

tests, passes =  testAll()

In [None]:
print('tests run:', tests)
print('tests passed:', passes)