### House Prices

We have data for the house prices on one road, and we want to know the price per bedroom for these houses.

In [1]:
import pandas as pd

houses = pd.DataFrame([
    [3, 2, 1945, "£445,000"],
    [2, 1, 1975, "£300,000"],
    [1, 1, 2013, "£240,000"],
    [2, None, 1945, "£314,000"],
    [1, 1, 2011, "£265,000"],
    [3, None, 1992, "£384,000"],
    [4, 2, 1986, "£498,000"],
])

houses

Unnamed: 0,0,1,2,3
0,3,2.0,1945,"£445,000"
1,2,1.0,1975,"£300,000"
2,1,1.0,2013,"£240,000"
3,2,,1945,"£314,000"
4,1,1.0,2011,"£265,000"
5,3,,1992,"£384,000"
6,4,2.0,1986,"£498,000"


Rename the columns as `"Bedrooms"`, `"Bathrooms"`, `"Year Built"` and `"Price"`

In [2]:
houses = houses.rename(columns={0:'Bedrooms', 1:'Bathrooms', 2: 'Year Built', 3: 'Price'})

There are missing values for two of the houses' number of bedrooms.

Fill these in with their number of bedrooms instead.

In [3]:
houses['Bathrooms'] = houses['Bathrooms'].fillna(houses['Bedrooms'])

Write a function to remove the `"£"` and `","` from the string and return the price as a numerical value instead of a string. Map this function to the price column.

In [4]:
def clean_price(price_str):
    price_str = price_str.lstrip('£')
    price_str = price_str.replace(',', '')
    return int(price_str)

houses['Price'] = houses['Price'].map(clean_price)

What is the average price per bedroom for a house on this road?

In [5]:
(houses['Price'] / houses['Bedrooms']).mean()

173261.90476190479

### Superstore Returns

Load the superstore data we used in the previous section by running the cell below:

In [6]:
superstore_sales = pd.read_csv('data/superstore_sales.csv')
# show top 5 rows
superstore_sales.head()

Unnamed: 0,Row ID,Order ID,Order Date,Order Priority,Order Quantity,Sales,Discount,Ship Mode,Profit,Unit Price,...,Customer Name,Province,Region,Customer Segment,Product Category,Product Sub-Category,Product Name,Product Container,Product Base Margin,Ship Date
0,1,3,13/10/2010,Low,6,261.54,0.04,Regular Air,-213.25,38.94,...,Muhammed MacIntyre,Nunavut,Nunavut,Small Business,Office Supplies,Storage & Organization,"Eldon Base for stackable storage shelf, platinum",Large Box,0.8,20/10/2010
1,49,293,01/10/2012,High,49,10123.02,0.07,Delivery Truck,457.81,208.16,...,Barry French,Nunavut,Nunavut,Consumer,Office Supplies,Appliances,"1.7 Cubic Foot Compact ""Cube"" Office Refrigera...",Jumbo Drum,0.58,02/10/2012
2,50,293,01/10/2012,High,27,244.57,0.01,Regular Air,46.71,8.69,...,Barry French,Nunavut,Nunavut,Consumer,Office Supplies,Binders and Binder Accessories,"Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl",Small Box,0.39,03/10/2012
3,80,483,10/07/2011,High,30,4965.7595,0.08,Regular Air,1198.97,195.99,...,Clay Rozendal,Nunavut,Nunavut,Corporate,Technology,Telephones and Communication,R380,Small Box,0.58,12/07/2011
4,85,515,28/08/2010,Not Specified,19,394.27,0.08,Regular Air,30.94,21.78,...,Carlos Soltero,Nunavut,Nunavut,Consumer,Office Supplies,Appliances,Holmes HEPA Air Purifier,Medium Box,0.5,30/08/2010


From this data frame, select only the `"Row ID"`, `"Order ID"`, `"Shipping Cost"` and `"Profit"` columns

In [7]:
superstore_sales = superstore_sales[['Row ID', 'Order ID', 'Shipping Cost', 'Profit']]

We will now load a second Data Frame with all the returned orders data for this superstore.

In [8]:
superstore_returns = pd.read_csv('data/superstore_returns.csv')
superstore_returns.head()

Unnamed: 0,Order ID,Status
0,65,Returned
1,69,Returned
2,134,Returned
3,135,Returned
4,230,Returned


Merge the two data frames together using an outer join on the column `"Order ID"`.

Show the first 5 rows of the merged dataframe using the `head()` method.

In [9]:
merged_df = pd.merge(superstore_sales, superstore_returns, how='outer', on='Order ID')

merged_df.head()

Unnamed: 0,Row ID,Order ID,Shipping Cost,Profit,Status
0,1,3,35.0,-213.25,
1,49,293,68.02,457.81,
2,50,293,2.99,46.71,
3,80,483,3.99,1198.97,
4,85,515,5.94,30.94,


Note that the `"Status"` row is mostly missing as it is now only filled in for the rows for which orders were returned.

Fill the null `Status` values with the string `"Not Returned"`.

In [10]:
merged_df['Status'] = merged_df['Status'].fillna('Not Returned')

We want to calculate a more correct profit figure, given that we know which orders were returned.

The company doesn't pass on shipping costs to its customers.

Therefore, to calculate a more correct profit value:
* Sum up only the profit from the `"Not Returned"` columns
* Take away the `"Shipping Costs"` from the `"Returned"` columns

In [11]:
profits = merged_df.loc[merged_df['Status'] == 'Not Returned', 'Profit'].sum()
shipping_costs = merged_df.loc[merged_df['Status'] == 'Returned', 'Shipping Cost'].sum()

profits - shipping_costs

1311912.54

### Cleaning Strings

Some people filled in a form with their first names, last names and heights. 

However the height field was a free text field and there was no error checking. Your job is to clean up these heights.

All the heights are given in metric but the string formats are different

Run the cell below to see the table.

In [12]:
heights = pd.DataFrame([
    {"first_name": "Stephanie", "last_name": "Bambery", "height": "1.60 metres"},
    {"first_name": "Barnard", "last_name": "Darbey", "height": '1m 80cm'},
    {"first_name": "Gale", "last_name": "Blind", "height": "154"},
    {"first_name": "Corry", "last_name": "Erbe", "height": "1 meter 80"},
    {"first_name": "Godard", "last_name": "Haslam", "height": "170 cm"},
    {"first_name": "Raimundo", "last_name": "Pelman", "height": "172 cm"}, 
    {"first_name": "Evvie", "last_name": "Rathke", "height": "165"},
    {"first_name": "Darius", "last_name": "Hymers", "height": "1.67 m"}
], columns=['first_name', 'last_name', 'height'])

heights

Unnamed: 0,first_name,last_name,height
0,Stephanie,Bambery,1.60 metres
1,Barnard,Darbey,1m 80cm
2,Gale,Blind,154
3,Corry,Erbe,1 meter 80
4,Godard,Haslam,170 cm
5,Raimundo,Pelman,172 cm
6,Evvie,Rathke,165
7,Darius,Hymers,1.67 m


They are currently strings but you would like them as floats in centimetres.

Write a function that will read in all of these height strings, convert them to centimetres, strip out any non-digit characters and return a float.

_Hint: have a look back at the presentation at all of the string transformation functions_
_This is not an easy problem, so you may want to look at the solution for help_

In [13]:
def digits_from_string_as_float(input_str):
    """
    Return a float using only digits from strings without using imports
    """
    list_of_digits = [x for x in input_str if x.isdigit() or x == '.']
    digit_str = ''.join(list_of_digits)
    return float(digit_str)


def clean_height_str(height_str):
    if height_str.endswith('m') and not height_str.endswith('cm'):
        metres = digits_from_string_as_float(height_str)
        return metres * 100
    elif height_str.endswith('metres'):
        metres = digits_from_string_as_float(height_str)
        return metres * 100
    else:
        return digits_from_string_as_float(height_str)

heights['height'] = heights['height'].map(clean_height_str)

In [14]:
heights

Unnamed: 0,first_name,last_name,height
0,Stephanie,Bambery,160.0
1,Barnard,Darbey,180.0
2,Gale,Blind,154.0
3,Corry,Erbe,180.0
4,Godard,Haslam,170.0
5,Raimundo,Pelman,172.0
6,Evvie,Rathke,165.0
7,Darius,Hymers,167.0
