<a href="https://colab.research.google.com/github/alt-bsmith/Github-and-Jupyter-setup/blob/main/setting_up_github_and_jupyter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Human-Centered Data Science
## Setting up GitHub and JupyterHub
If you're reading this, you successfully forked the Git repo for this assignment from GitHub. Congratulations!   
  
Now that you've done that, you can see the first Jupyter Notebook for the class. To continue the assignment, let's do a few things with the notebook that we'll often see in data science tasks. But first, let's take a look at the Notebook itself. One of many great things about Jupyter Notebooks is that we can combine text, programming code, and visualizations in the same file.

### Cells
Jupyter Notebooks contain *cells* where you can write code an text, display visualizations, etc. You put content into a cell and then run the cell to get an output. Let's take a closer look at how to enter and run Python code in a cell.   
   
The cell below contains a simple Python statement that adds and prints two numbers. Move your mouse over the cell and click into it. You can run the cell in two ways:
1. When you hover over or click into a cell in Colab, you'll see a triangle icon, the play button, appear on the left side. Click that, and the cell will run.
2. Type Shift+Enter on your keyboard after clicking on the cell.  

In [1]:
## Our first line of Python code. Run this cell by clicking on the run button
## in the top menu or clicking Shift+Enter anywhere in the cell.
print(25+75)

100


You should see the number 100 if the cell ran successfully. This is just the beginning. We can create far more complex Python code in the cells, but they still run the same way: Either click on the run icon or type Shift+Enter in the cell. Here's another example for you to run. By the way, don't get worried about the actual Python code yet.

In [26]:
import string

# here's a very simple way to do sentiment analysis
def sentiment_analysis(text):
  # List of positive words
  positive_words = ["good", "great", "excellent", "love", "happy"]

  # List of negative words
  negative_words = ["bad", "terrible", "hate", "sad", "unhappy"]

  # remove punctuation
  text = text.translate(str.maketrans('', '', string.punctuation))

  # Make the text all lowercase. Then split into individual words
  words = text.lower().split()

  # count the number of positive and negative words
  positive_count = sum([1 for word in words if word in
positive_words])
  negative_count = sum([1 for word in words if word in
negative_words])

  # If more positive words than negative words, return 'positive'
  if positive_count > negative_count:
      return 'positive'
  # If more negative words than positive words, return 'negative'
  elif positive_count < negative_count:
      return 'negative'
  # else return 'neutral'
  else:
      return 'neutral'

# Test the function
print("'I had a great time!' is", sentiment_analysis("I had a great time!"))  # Output: positive
print("'I was unhappy!' is", sentiment_analysis("I was unhappy."))  # Output: negative
print("'I feel blah today.' is", sentiment_analysis("I feel blah today."))  # Output: neutral


'I had a great time!' is positive
'I was unhappy!' is negative
'I feel blah today.' is neutral


That's a *very* simplified version of sentiment analysis, determining whether a sentence or phrase is positive, negative, or neutral. It's a useful technique in text analysis, and you'll certainly see more realistic implementations of it later in the program.

### Markdown
The text you're reading now is in a format called [Markdown](https://www.markdownguide.org). Markdown is a simple language that lets you add formatting to plain text documents. You're used to formatting text by clicking on words or phrases and then selecting a format, e.g., bold, italic, Header, etc. In Markdown, you don't do apply formats this way: Instead, we add special codes to the text to specify the way it should appear.`

For example, the first line in this document is a heading. To make it appear large, we add a number sign before (e.g., `# Intro to Human-Centered Data Science`). You can make _italicized_ by adding an asterisk or underscore before and after it (e.g.,  `_Jupyter Notebooks are great!_`). Want text in **bold**? Add two asterisks before and after the text (e.g., `**my bold text**`).  

A Jupyter Notebook cell can be set to edit and display text with Markdown formats. It's a useful way to add documentation to your project with all the benefits of formatted text. [Here's](https://www.markdownguide.org/cheat-sheet/) a useful cheatsheet that describes all the Markdown syntax. Let's practice using some of the codes you'll use often.    


Using the [cheat sheet](https://www.markdownguide.org/cheat-sheet/) as a guide, write a short piece of text introducing yourself below adding the following formatting in Markdown:
1. Include a *heading* with your name.
2. Tell us where you did your undergraduate degree. Format the university in **bold** and *italicize* your major.
3. Tell is why you decided to studay data science. Highlight your response using as a *blockquote*
4. Name three of your hobbies using a *numbered* list.
5. Provide the name of a web site that you often visit and include a *hyperlink* to that site.

### ENTER YOUR TEXT WITH MARKDOWN BELOW  
To edit a Markdown cell, you need to double-click it.  

When you're done entering your text, hit Shift-Enter to run the cell and see the formatted text!


### Importing Python libraries
We learned earlier that the Python programming language has many libraries that provide useful tools for free. We use a Python command called `import` to include a library in a Notebook. For example, the statement `import math` will make Python's math library available in our Notebook. The library lives in what is called a _module_. Think of it as a container filled with lots of math functions and constants. We have use what's called _dot notation_ to access the functions inside the module. Look at the code below for an example:


In [27]:
import math
print(math)

<module 'math' (built-in)>


We used `import` to bring the math library into our Notebook. When we print math, we see its type identifier. Don't worry too much about this except to note that it is a module. Now let's access the content `pi` within the module.

In [28]:
print(math.pi)

3.141592653589793


There's the dot notation. We write `math.pi` to access the constant value 3.14. What happens if we try to ask for pi _without_ the math module?

In [29]:
print(pi)

NameError: name 'pi' is not defined

Look at all of that text above: That's Python's way of signaling an error. As you can guess, 'pi' is undefined. But `math.pi` *is* defined and ours to use once we import math. In fact, we can call lots of math functions using dot notation:

In [34]:
print("The factorial of 5 is =", math.factorial(5))
print("90 degrees in radians =", math.radians(90))
print("The sine of 90 degrees =", math.sin(math.radians(90)))
print("We can represent non-numbers with", math.nan)

The factorial of 5 is = 120
90 degrees in radians = 1.5707963267948966
The sine of 90 degrees = 1.0
We can represent non-numbers with nan


Don't worry too much about the actual math. The important thing to note here is we import libraries as modules, and we access things inside of them with `module_name.function_or_constant_name` (e.g., `math.inf` returns a constant for positive infinity).  

Sometimes we only want to import a few functions or constants from a library. As an example, imagine we only need the constant `pi` and the function `pow` from the math library. We can do the following to just get those two:

In [37]:
from math import pi, pow
print(pi)
print("2 to the 5th power =", pow(2,5))

3.141592653589793
2 to the 5th power = 32.0


We can also import a library and give it an alias. For example, we will us a library named `pandas` **a lot** in this and other courses. You often see the pandas module given the nickname or alias `pd`. Here's how to do that:

In [38]:
import pandas as pd
print(pd)
my_series = pd.Series([1,3, 5, 7, 9])
print(my_series)

<module 'pandas' from '/usr/local/lib/python3.10/dist-packages/pandas/__init__.py'>
0    1
1    3
2    5
3    7
4    9
dtype: int64


You can see that we import pandas *as* `pd`. Then we can use dot notation on pd to use functions in the pandas library. We printed pd to make sure it's a module. Then we called a function named `Series` to make a one-dimensional array of numbers. Think of a `Series` as a single column or row in a data table.

### Loading data from a file into a `pandas` `DataFrame`

We'll often have data stored in a file or database that we want to load into a Notebook. pandas has a set of [input/output functions](https://pandas.pydata.org/docs/user_guide/io.html) that let us load and store data from HTML, common-separated value (CSV), Excel, SQL, and other files. Let's try a simple example with a CSV file.
  
You should have a file named 'student-scores.csv' in the repository that you forked from GitHub. Let's load that file using the pandas function `read_csv`.

In [None]:
df = pd.read_csv("student-scores.csv")
df.head()

Here's what we just did:
1. We called `pd.read_csv` with the name of our data file, student-scores.csv.
2. `read_csv` loads the file and converts it to a pandas `DataFrame`. A `DataFrame` is like an Excel sheet, a bunch of data elements in rows with different attributes or features in the columns. We assign the new `DataFrame` to a variable called `df`.
3. Now that `df` is defined as a `DataFrame`, we can use dot notation to run `DataFrame` functions on it. One of those is called `head` which returns the first five rows of the data table.
You see that the columns of the table are named things like `first_name`, `last_name`, `gender`, etc. This is a table of student records, so you'l also see things like `absence_days`, `math_score`, `physics_score`, etc. You can imagine a scenario where these data might be used to understand patterns or relations between variables (e.g., are there correlations betweem math and physics scores). We'll do more of that kind of analysis in the next module.  
  
You also have an Excel version of student-scores. You should be able to find a pandas command that works like `read_csv` but loads an Excel file instead. [This list](https://pandas.pydata.org/docs/user_guide/io.html) of input/output functions will help you.


In [None]:
# create a variable named excel_df. then use a pandas function to
# read the Excel data file named 'student-scores.xlsx'. Assign
# the DataFrame to the df_excel variable.

### CREATE df_excel BELOW AND ASSIGN IT WHAT YOU READ FROM 'student-scores.xlsx'.


### KEEP THE LINE BELOW SO YOU CAN SEE IF YOU HAVE THE DATA IN df_excel
excel_df.head()

You should have two dataframes, `df` and `df_excel` that contain the *same* data. Let's make sure that's true using the `pandas` function `equals()` below:

In [None]:
df.equals(excel_df)