# Lab 5

**This lab must be completed individually.**

Where provided, try your best to match the **Sample Output** as best as you can.

In [24]:
# Before moving forward, let's import these libaries first.
import pandas as pd
import numpy as np

## A. Plotting using Seaborn (and matplotlib) (Total: 2 marks)

`matplotlib` is a Python library for data visualization. 
`seaborn` is a statistical data visualization library layer that provides a high-level interface for drawing statistical graphics and some convenient functions for plotting data frames.

You may need to install `seaborn` and `matplotlib`

`conda install seaborn`<br>
`conda install matplotlib`

and just in case it's not the latest version, go ahead and update it:

`conda update matplotlib`<br>
`conda update seaborn`



### A1: Set the Seaborn figure theme and scale up the text in the figures (2 marks)

There are five preset Seaborn themes: `darkgrid`, `whitegrid`, `dark`, `white`, and `ticks`. 
They are each suited to different applications and personal preferences.
You can see what they look like [here](https://seaborn.pydata.org/tutorial/aesthetics.html#seaborn-figure-styles).

Hint: You will need to use the `font_scale` property of the `set_theme()` function in Seaborn.

In [25]:
import matplotlib.pyplot as plt
import seaborn as sns

# Your solution here

## B: Exploratory Data Analysis (34 marks)

For following part of the lab, we're going to use a dataset from [Kaggle.](https://www.kaggle.com/agirlcoding/all-space-missions-from-1957)

### B1. Describe the dataset (2 marks) 

Consider the following questions to guide you in your exploration:

- Who: Which company/agency/organization provided this data?
- What: What is in your data?
- When: When was your data collected (for example, for which years)?
- Why: What is the purpose of your dataset? Is it for transparency/accountability, public interest, fun, learning, etc...
- How: How was your data collected? Was it a human collecting the data? Historical records digitized? Server logs?


#your solution here

*Hint: You probably will not need more than 250 words to describe your dataset. All the questions above do not need to be answered, it's more to guide your exploration and think a little bit about the context of your data. It is also possible you will not know the answers to some of the questions above, that is FINE - data scientists are often faced with the challenge of analyzing data from unknown sources. Do your best, acknowledge the limitations of your data as well as your understanding of it. Also, make it clear what you're speculating about. For example, "I speculate that the {...column_name...} column must be related to {....} because {....}."*

### B2. Load data (1 mark)

Without downloading the csv file to your repo, load the "*Space_Cleaned.csv*" file using the direct URL from [this link](https://gist.githubusercontent.com/lintonylin/4f9ba13dc37b7510ea392d95c494f891/raw/1092dba2c54ed10d03f2999d8ad7878757b39a8f/Space_Cleaned.csv)

**DO NOT DOWNLOAD THE DATA TO YOUR REPOSITORY! ** 
Open the link, copy it and pass it to `read_csv()`.

Use `pandas` module/package and the `read_csv()` function to load the data by passing in the URL and then save the data in a variable called `df`.

In [38]:
#your solution here

### B3. Explore your dataset (3 marks)

Which of your columns are interesting/relevant? Remember to take some notes on your observations, you'll need them for the next EDA step (initial thoughts).

#### B3.1:  You should start with [`df.describe().T`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) (2 marks)

See [linked documentation]((https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) for the use of `include`/`exclude` to look at numerical and categorical data.

In [39]:
#your solution here

In [40]:
#your solution here

#### B3.2 Let's try `pandas_profiling` now. (1 mark)

**Hint: To install the [`pandas_profiling`](https://towardsdatascience.com/exploratory-data-analysis-with-pandas-profiling-de3aae2ddff3) package, you'll need to use `conda`:**

- `conda install -c conda-forge pandas-profiling`

In [41]:
import pandas_profiling as pdp
#your solution here

### B4. Initial Thoughts (2 marks)

#### B4.1. Use this section to record your observations. (2 marks)

Does anything jump out at you as surprising or particularly interesting? 

Where do you think you'll go with exploring this dataset? Feel free to take notes in this section and use it as a scratch pad.

Any content in this area will only be marked for effort and completeness.

#### # Your observations here:

- Obs 1
- Obs 2
- ...

### B5. Wrangling (10 marks)

The next step is to wrangle your data based on your initial explorations. Normally, by this point, you have some idea of what your research question will be, and that will help you narrow and focus your dataset. 

In this lab, we will guide you through some wrangling tasks with this dataset.

#### B5.1 Change name of the column Rocket to Mission Cost and save it. (The name of the column rocket has one space before it (use ' Rocket'))(1 mark)

In [36]:
#your solution here

#### B5.2 Drop any NULL values if there is any. (Keep in mind whether you decide to save the dataset with no null or not will effect your future plots) (1 mark)

In [42]:
#your solution here

#### B5.3 Reset the index to get a new index without missing values (1 mark)

In [None]:
#your solution here

#### B5.4. A new column was added called `index`; remove it. (1 mark)

In [None]:
#your solution here

#### B5.5 Sort the dataframe by column Company Name. (1 mark)

In [None]:
#your solution here

#### B5.6 Add a new column to the dataframe to convert the "Datum" column to a datetime object (2 marks)


To do this, first we need to add a new column to our dataset to turn the column "Datum" into a proper datetime object so we can do operations on it.

*Hint: Use to [to_datetime()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function to help you first convert it into a datetime object, and then remove the timezone information and HH:MM:SS using [`.dt.date`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.date.html).*

In [None]:
#your solution here

#### B5.7. Find the earliest and the latest reported launch in the dataset. (3 mark)

You should use the pandas .min() and .max() functions here, now that your date string is converted to a DateTime object.

##### Sample Output

> The first launch in the dataset happended : 1957-10-04.<br>
> The last launch in the dataset happended : 2020-08-07.<br>

### B6. Research questions (2 marks)

#### B6.1 Come up with at least two research questions about the dataset that will require data visualizations to help answer. (2 marks)

Recall that for this purpose, you should only aim for "Descriptive" or "Exploratory" research questions.

**Hint: You are welcome to calculate any columns that you think might be useful to answer the question (or re-add dropped columns.***


#### # Your solution here: 

**1. Sample Research Question:** Which Company has the most mission cost?

**2. Your RQ 1:**

**3. Your RQ 2:**



### B7. Data Analysis and Visualizations (10 marks)

#### B7.1. Counts of mission status (2 marks)
Using [`sns.countplot()`](https://seaborn.pydata.org/generated/seaborn.countplot.html?highlight=countplot#seaborn.countplot), plot the number of space launches by their status. 

Set the title to be "Status Mission of space launches". 

*Hint: The documentation above contains some examples that might help you get started*
#### Sample output
<img src="./images/bar1.png" width="400px" />

In [None]:
#your solution here

#### B7.2. Counts of launches by country (2 marks)

Plot the counts of launches by country, and order the y-axis by increasing the number of launches (use the `order` parameter of the `countplot()` function).

<img src="./images/bar2.png" width="700px">

In [None]:
#your solution here

#### B7.3 Status of mission for the top 5 companies (3 marks)

Plot the counts of launches by mission status and country, and order the y-axis by increasing the number of launches 

*Hint: More information and examples can be find in [link](https://www.geeksforgeeks.org/matplotlib-axes-axes-barh-in-python/)*

*Hint: Your plot doesn't have to look exactly like this, but please do explore the [possible color palettes](https://seaborn.pydata.org/tutorial/color_palettes.html). You can specify the colour palette by passing in the keyword like this: `palette='colorblind'`.*

#### Sample output
<img src="./images/bar3.png" width="600px" />

In [45]:
#your solution

#### B7.4. Plot the launche counts plotted over time by mission status (3 mark)

Using `sns.displot`, plot the histogram of launches over time.

*Hint 1: [Here is a nice tutorial](https://seaborn.pydata.org/tutorial/distributions.html) of all the different options that are possible when creating a histogram.*

#### Sample output
<img src="./images/bar4.png" width="800px">

In [None]:
#your solution here

#### B7.5. BONUS - For a bonus mark, move the legend to the top left of the plot (1 mark)

In [None]:
#your solution here

#### B7.6. BONUS - For a bonus mark, plot a similar graph like D7 that compares top 5 counties in terms of number of launches(1 mark)



In [None]:
#your solution here

### B8. Summary and conclusions (4 marks)

#### B8.1. Summarize your findings and describe any conclusions and insight you were able to draw from your visualizations. (3 marks)

- **Research Question 1:** RQ here

    - Summary of findings, insight, and conclusions
    - ..
    

- **Research Question 2:** RQ here

    - Summary of findings, insight, and conclusions
    - ..

## C. Method Chaining (6 marks)

Method chaining allows you to apply multiple processing steps to your dataframe in a fewer lines of code so it is more readable. You should avoid having too many methods in your chain, as the more you have in a single chain, the harder it is to debug or troubleshoot. I would target about 5 methods in a chain, though this is a flexible suggestion and you should do what makes your analysis the most readable and group your chains based on their purpose (e.g., loading/cleaning, processing, etc…).

#### C1. Use Method Chaining on the commands from sections B5.1, B5.2, B5.3, B5.4, B5.5, B5.6 (6 marks)

In [17]:
# Your Solution here