## Introduction to Data Science

#### University of Redlands - DATA 101
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data101.joannabieri.com](https://joannabieri.com/data101.html)

---------------------------------------
# Homework Day 2
---------------------------------------

GOALS:

1. Start using Python
2. Start looking at Data!

----------------------------------------------------------

This homework has **TWO** warm-up problems and **FOUR** problems.


## Hello World!

### Warm up Problem 1

Write python code that will print your name and run the cell.

In [2]:
print ("Ayden Terning")


Ayden Terning


## Installing Modules

### Warm up Problem 2

Install the modules that you will need to run the rest of the code. You can copy and paste the commands from the class notes. This might take a minute or two to run... go get a cup of coffee :)

In [4]:
### This will take a while to run - just let it go.
!conda install -y numpy
!conda install -y pandas
!conda install -y matplotlib
!conda install -y plotly
!conda install -y itables
!conda install -y statsmodels
!conda install -y -c conda-forge python-kaleido


Retrieving notices: ...working... done
Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.

Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - pandas


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libcxx-18.1.8              |       h3ed4263_7         427 KB  conda-forge
    pandas-2.2.2               |  py312h8ae5369_1        13.8 MB  conda-forge
    ------------------------------------------------------------
                                           Total:        14.2 MB

The following packages will be UPDATED:

  libcxx                pkgs/main::libcxx-14.0.6-h848a8c0_0 --> conda-forge:

## Let's do some Data Science.

We will explore some data about how countries vote at the United Nations General Assembly. You should follow along with the notes and/or class video.

I am breaking this into parts and having you copy and paste the code so you can start to identify what the different parts of the code do.

-------------------------------
### Importing Packages:
-------------------------------

Every time we start a new project we will import the packages that will help us do the analysis. Copy and paste all of the imports in the cell below.

* numpy = mathematical and number packages for python
* pandas = pretty tables, dataframes, and data analysis packages
* matplotlib.pyplot = nice looking graphs
* plotly = nice looking graphs.
* itables = pretty looking tables that have a search bar

In [6]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots

from itables import show

-------------------------------
### Get the Data:
-------------------------------

We will practice all sorts of ways to get data into our notebooks. When you use Pandas (pd) to read a csv:

    DF = pd.read_csv(file_location)

you are basically reading in a spreadsheet (Excel). The data for today is stored on my website.

It might take a minute for the data to load.

In [53]:
# Note this takes about a minute to run
file_location = 'https://joannabieri.com/introdatascience/data/unvotes.csv'
DF = pd.read_csv(file_location)
years = [int(d.split('-')[0]) for d in DF['date']]
DF['year'] = years
DF = DF.drop('Unnamed: 0',axis=1)





-------------------------------
### Initial Data Exploration:
-------------------------------
-------------------------------
### Show the Data:
-------------------------------

We want to look at what we have. The data is now represented by the variable DF.

In [27]:
show(DF)


rcid,country,country_code,vote,session,importantvote,date,unres,amend,para,short,descr,short_name,issue,year
Loading ITables v2.1.4 from the internet... (need help?),,,,,,,,,,,,,,


### Problem 1

Do some initial exploration of this data and write about it in the markdown cell below.

* How man columns are there?
* What do the columns represent?
* What countries are there?
* Can you search for a country you are interested in?
* How many rows are there?
* What other observations do you have?



The initial thing I went looking for is the amount of columns. There is 15 columns in this data set. The columns represent differnt points that the data set is highlighting. Using these different points the data anylast is able to compare differnent countries and eras to each other. There is all sorts of different countries orginating from all different parts of the globe. In the top right corner of the data you are able to make a specific search for a country. The data says there are 546 entries, so I am going to make an estimation that there is 546 entries. Another observation that I noticed is that the countries are organized alphabetically. What came to my mind right of the bat was the similarites between this data set and programs like Microsoft Xcel. One major similarity would be the rows and the cells that organize the data.  

-------------------------------
### Make Python Explore the Data!:
-------------------------------

Copy and paste the command that will show you a list of the counties:

In [38]:
# Python can list all the different countries:
country_list = list(DF['country'].unique())

# Show the data in a nice way
show(pd.DataFrame(country_list,columns=['country']))


country
Loading ITables v2.1.4 from the internet... (need help?)


Copy and paste the command that will count up the number of countries:

In [36]:
# Python can count up the number of countries.
# Find the length of the list
print(len(country_list))

200


### Problem 2

Try writing some code for your self. Above we found a countries list by focusing on the column named 'country'. See if you can redo that same code but change it to focus on the column named 'issue'.

What do I expect here:

* First copy and past the code from above
* Then change that code slightly
* Run the cell to see if it worked

In [40]:
# Python can list all the different countries:
country_list = list(DF['issue'].unique())

# Show the data in a nice way
show(pd.DataFrame(country_list,columns=['issue']))


issue
Loading ITables v2.1.4 from the internet... (need help?)


-------------------------------
### Data Visualization:
-------------------------------

Now we will select three countries that we are interested in and see how their votes have changed over time. Below you should see code that selects: Turkey, United States, and United Kingdom.

**IMPORTANT** These have to be spelled and capitalized exactly like they are in the data. Python is unforgiving of typos!

You can just run the cell below - assuming you have done all the parts above!

In [68]:
countries = ['South Africa', 'Peru', 'India']
issues = list(DF['issue'].unique())
c_groups = DF.groupby(['country','issue'])
print(issues)

['Human rights', 'Economic development', 'Colonialism', 'Palestinian conflict', 'Arms control and disarmament', 'Nuclear weapons and nuclear material']


Now that we have our focus countries, we can make pretty pictures.

You can just run the cell below - assuming you have done all the parts above!

In [73]:
def make_plot(countries,issue):
    '''
    A Python function that takes in the list of countries and issues and makes
    a scatter plot of each issue with a trendline for each country.
    '''
    x_data = []
    y_data = []
    c_data = []
    for cntry in countries:
        my_group = c_groups.get_group((cntry,issue))
        for y in my_group['year'].unique():
            x_data.append(y)
            tot_yes = sum(my_group[my_group['year']==y]['vote']=='yes')
            percent_yes = tot_yes/len(my_group[my_group['year']==y])*100
            y_data.append(percent_yes)
            c_data.append(cntry)

    fig = px.scatter(x=x_data, y=y_data,color=c_data,trendline="lowess",labels={"color": "Country"})

    fig.update_layout(
        title={
            'text': issue + '<br>',
            'y':0.9,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'})
    fig.update_yaxes(title_text="% Yes")
    fig.update_xaxes(title_text="Year")
    f_name = issue
    fig.write_image(f_name+'.png')
    fig.show()
    
for iss in issues:
    make_plot(countries,iss)

### Problem 3

Choose one of the graphs produced above (should match the lecture notes) and discuss what you see there:

1. Which issue are you focusing on.
2. What do the graph axis represent?
3. Say in words how the United States votes have changed over time.
4. How do the United States votes compare to the other countries?
5. See if you can come up with one or two questions that interest you about the data in the graph.

The issue I am focusing on is economic devolpment. On the x-axis of the graph it shows the year, while on the y-axis of the graph it shows the percentage of people that voted yes. When the data first began to be collected the percentage of yes votes was very high in the United States, and then it began to drop each decade. Until around the 2000s when Americans began to start voting yes. The United States experienced a similar trend in compraision to the United Kingdom, while Turkey opposite of the United Kingdom and United States. However there was a few differnces between the United States and the United Kingdom and it was that the United Kingdom didn't start with nearly the amount of voters, voting yes. Another one was the United Kingdom didn't lose nearly the amount of voters voting yes as the United States did before each country hit their respected inflection points. One question that I would propose was why did both the United States and United Kingdom decrease and increase in the percentage of voters voing yes, at the same time? Anotehr question I would have would be, why did the Turkey have an opposite graph in comparison to the two other nations?

### Problem 4

Now go back up to where we picked our three focus countries and choose some different ones. Rerun the analysis. If you want you could choose more than three countries. 

Discuss the graph for the same issue but with the new countries.

For this section I picked South Africa, Peru, and India. What I noticed was that all three countries had similar trends, the graphs went positive for all countries at the same time. On the contary one thing I noticed was that when South Africa and India hit the highest point on the graph they were going to hit, both countries began to slope negative. While Peru stayed at alomst a perfect slope of zero. A question I would raise would be what caused South Africa and India to both experince negative slopes? This question would also help me identify why Peru maintained a neutral slope. 

--------------------------------
### You are done with the homework... now what?
--------------------------------

1. Save your changes.
2. In the Git tab, **Stage** your changed files - use the (+) button next to the file.
3. **Commit** your changes by entering a summary and pushing commit.
4. **Push** your changes using the cloud button.
5. Check that your changes are on your GitHub repo (online)
6. **TAKE THE DAILY QUIZ ON CANVAS** the quiz will ask you to copy and past the HTTPS link to your repo for the day.
7. Come to class and get your questions answered.

At the end of the week  you will submit your final versions of the homework on Canvas for grading. For this you can just drag and drop the HW.ipynb files into canvas. 

**IMPORTANT** If this is confusing this first week, don't worry, I can help you and I will be really flexible about these first few deadlines.
