# Assignment report template for ITNPBD2
This Jupyter Notebook file contains each of the tasks that you need to attempt for the Assignment in ITNPBD2.

You should have already read the instructions on the Canvas page https://canvas.stir.ac.uk/courses/11055/assignments/80892  
If not __please do so before continuing__

Read each question/task carefully before providing your answer and solution.

# 1) Crossing a road at an angle?
Look at the attached picture. If you were walking from point A to point B that is L meters further down the street, it is obviously shorter to walk in a straight line between the two points (Blue line), rather than crossing the road first (via the crosswalk) and then walking alongside the road for L meters (Red lines)

Your task is to write a script that accomplishes the following:
 1. Calculates the distance that you save by crossing at an angle for any given distance L
 2. Calculates the additional distance travelled on the road by crossing at an angle.
 3. Demonstrates on a graph the upper bound of the distance saved by increasing L
 4. Finds the length L, within 2 decimal points, where the distance saved no longer exceeds the additional distance travelled on the road. That is, the length L when you risk more than you gain.
 5. Demonstrates point 4 on a graph 

You should assume that both sides of the road are straight lines, they are infinitely long, and that they are parallel.![crossing_street.jpg](attachment:crossing_street.jpg)

In [None]:
#  Write your  code  and  comments  here  below 

# 2) Fetch the data
Use requests _get_ to load the data from http://cs.stir.ac.uk/~soh/BD2spring2022/assignmentdata.php into an XML tree, then:
- extract the root element as separate variable, and display the root tag. 
- extract the two children of the root element into another two separate variables, and display their tags as well.

You will need to provide a single parameter to the get request, it has been provided in the code cell below for your convenience, as has the url.

Use Python to accomplish every step of this, i.e. __do not__ manually save the data into a file and then read the file with open() or something equivalent.

Name the tree variable *tree*, the root element *root*, and the children elements *tweets_branch* and *cities_branch*


In [None]:
import xml.etree.ElementTree as ET
import requests
url = "http://cs.stir.ac.uk/~soh/BD2spring2022/assignmentdata.php"
params = {'data':'sputvws'}
#  Write your  code  and  comments  here  below 

# 3) Separate the two branches into two lists of dictionaries
Create two variables __cities__ and __tweets__ that contain each main branch of the XML tree as lists of dictionaries.  
Make sure that the data values are of appropriate types.  
Print out the field names and values (keys and values) of one city and one tweet.

*hint: latitude and longitude might be best kept as strings*  

In [None]:
#  Write your  code  and  comments  here  below 

# 4) How many unique City - Country pairs exist in the data?
Find out how many different locations are represented in the twitter data with City and Country pairs.  
Does the tweet data contain more, less, or equal number of pairs?

- Print out the 10 most populated cities (largest)
- Print out the __number__ of unique pairs and show that they match the number of cities in the cities branch.

You can either use ElementTree methods on the XML tree itself or work with the list of dictionaries variables in addition to any looping and built-in functionality you see fit.  


In [None]:
#  Write your  code  and  comments  here  below 

# 5) Extract the data into Pandas Dataframes
Create __2__ Pandas Dataframes from the list of dictionaries and make sure you use appropriate data types for each column  
There are missing values in the data and make sure they are represented in the Dataframe with *NaN*  

Call the Dataframes variable __raw_tweet_data__ and __cities_data__

Include the ID of each person as a column and display the first 5 rows of the dataframe

*hint: each person's id is the id attribute of the corresponding xml tag*  

In [None]:
#  Write your  code  and  comments  here  below 

# 6) Clean the twitter data
Find the missing values and replace, remove, and standardise as appropriate
- ID should be standardised to XXXXXXXX-XXXX (i.e. 8 and 4 digits separated by a dash), make the ID the index of the dataframe
- Records with missing tweets, age, name, city, or location should result in a removal of the data point
- Phone numbers should be strings but only contain digits, no other characters or whitespaces and missing phone numbers should be replaced with the string '000'
- Missing country should be replaced with the country that corresponds to the city if possible.
 * I.e. find another data point with the same city and copy the country, if there is no other record with the same city then remove.
 * __This does not need to be completed programmatically__, that is, you can use one cell to supply the information and then another cell to fix it with direct assignment statements. Full marks will be given for this approach if the intensions are clear and the code is well formed and documented.

Store the resulting cleaned dataframe in __cleaned_tweet_data__  
__Print the total number of records and the number of rows with missing values before and after cleaning__

In [None]:
#  Write your  code  and  comments  here  below 

# 7) Validating
Use the data from __cities_data__ to check if there are any mismatches in the data.
- Are there latitudes and longitudes that don't match the City name?
- Are there any cities that are "located" in the wrong country?

Correct where possible, remove otherwise.  
Assume the cities data is accurate and use the "city" columns to match names

Correct the __cleaned_tweet_data__ in place and __print, display or comment on how many mismatches you found__

This does not have to be done programmatically in a single comprehensive search-and-fix code. You can use 1 or more cells and "hard code" the search and the fixes with incremental steps.  
For example, the first cell prints out info of mismatches, what and where they are and the next cell uses that info to fix the mismatches. 

In [None]:
#  Write your  code  and  comments  here  below 

# 8) Grouping by country and city
Find out 
- the mean, median, and standard deviation of the age for each country.
- Answer the following:
 - What country has the most tweeters?
 - What __city__ has the most tweeters per capita?
 
The cities data contains information about population

In [None]:
#  Write your  code  and  comments  here  below 

# 9) Plot the age distribution by country
Create a figure that contains the distribution of age per country as box plots.  
For each box, visually show the mean and the confidence interval of the median in the figure, preferably using arguments of the plot function.

In [None]:
#  Write your  code  and  comments  here  below 

# 10) Freestyle
Your answer will be judged by the clarity of the description, the creativity of the solution, and how realistic your suggested implementation is.

This is an opportunity to display what you have learnt in the module and how you believe it can be useful in practice.

### The task
Choose any dataset on Kaggle (https://www.kaggle.com/) and describe a simple data analysis you would want to do with that dataset.
- Provide the link to the dataset, and describe it __in your own words__ (very short, 3-10 sentences max)
- Justify your choice of the dataset (max 50 words)
- Document your process with a mix of appropriate comments and markdown boxes inbetween code boxes.
- Assume we have access to the dataset and if we would download it into the __current working directory__ we would be able to do the same analysis on our machine
 - If possible, don't make any other assumptions about the folder structure on the machine the code would be run on. That is, make your solution as OS and folder agnostic as possible.
- Implement as much of your idea as possible, using solutions you have learnt in this module. 
 - Go through the process of *load->inspect->clean->explore->visualise->etc* and document each step
- For the ideas that require solutions or tools not covered in this module, provide at least a brief description and suggestions of what tools or solutions you would use.
- Be as clear and concise as you possibly can

In [None]:
# Provide your solution here and in any subsequent cell/box you deem necessary
# Use a mix of markdown and code cells.
#  Write your  code  and  comments  here  below 