# Data Science Basics - Day 1

Data Science is the process of extracting meaningful information from data by combining programming skills, domain knowledge, and statistics.

#### The Four Steps of any Data Science Project
1. **Data Collection** (doesn't always happen through programming, but sometimes does - webscraping, text extraction, image processing, simulations, etc.)
2. **Data Cleaning**
3. **Data Analysis**
4. **Data Visualization**

#### <br>Which Python tools might you use? (an incomplete list)
1. **Data Collection**
- Webscraping: requests, beautifulSoup
- Text Extraction: pdfminer, pytesseract, many others for reading and translating pdfs
- Image Processing: Fiji, ImageJ
- Simulations: domain-specific packages, multithread, multiprocessing, scipy, numpy, etc.
2. **Data Cleaning**
- Datasets: pandas, numpy
- Images: Fiji, Image J, pandas
- Text: NLTK, textblob, pandas, others
- Big Data: pySpark
3. **Data Analysis**
- Statistical Models: scipy, statsmodels, pandas, numpy
- Machine Learning/Predictive Modeling/Classification: Scikit-learn
- Deep Learning, Image Classification: TensorFlow, Keras, pyTorch
- Text: NLTK, others, plus machine learning tools
- Big Data: pySpark
4. **Data Visualization**
- Static: pandas, matplotlib, seaborn, plotly, geopandas, pandas
- Interactive: bokeh, plotly, flask


#### <br>Why do we work with Jupyter Notebooks for data science?

Jupyter Notebooks allow us to view nicely formatted output (such as pandas DataFrames and data visualizations) directly below the code used to create the object. They also allow you to scroll through large DataFrames or images.

<br><br>Most of this week is going to focus on the Python package Pandas.
However, Pandas (and many other Python packages) are built on NumPy arrays, so we're going to start there. Some of you will also want to use NumPy to work with large numerical datasets.

## <br><br>An introduction to NumPy arrays

NumPy is a Python package that allows you to do **fast** operations on numerical data, including "mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more."

NumPy comes with the Anaconda distribution of Python and is available on Google Colab.

<br>We import NumPy as a shortened nickname, `np`, which is commonly used for NumPy.

In [1]:
import numpy as np

<br>The heart of NumPy is the **array** object.

<br>NumPy arrays are used behind-the-scenes of many Python packages, including:
- pandas, GeoPandas
- matplotlib, plotly, seaborn, bokeh
- SciPy, scikit-learn, statsmodels
- TensorFlow, PyTorch
- Jupyter
- Biopython
- many, many more

<br>Some packages will output an array, even if you gave the function a list or other object. ***This is one reason why you should learn to recognize and index an array, even if you don't think it is an object you need to create in your work.***

<br>**Topics for today:**
- What is a numpy array?
- recognizing arrays
- indexing arrays
- looping through arrays
- **Where to learn more:** https://numpy.org/devdocs/user/quickstart.html

### <br><br>What is a numpy array and how do we recognize it?

The array is a *multidimensional* object.
<br><br>It can have up to *n* dimensions.
<br><br>There are lots of applications for multidimensional objects in scientific research (see https://numpy.org/ for a few case studies), but let's think about just one case. Let's say you have a series of lat/long points for the state of Illinois - you would have 2-dimensional data. For each lat/long, you also have air quality data - layers of several different air polutants at each point - you now have 3-dimensional data. You also have a full set of these measurements for every hour of the day - a 4th dimension. 

<br>An array looks like a list of lists, but on its own a list of lists is not multidimensional, it is simply a collection of multiple one-dimensional objects. We can use the function `np.array()` to define our list of lists as an array:

In [None]:
x = np.array([[1, 2, 3], [4, 5, 6]])

In [None]:
print(x)

In [None]:
x

<br>You can see that an array looks different than a list when you print it and when you return it.

In [None]:
list_x = [[1, 2, 3], [4, 5, 6]]

In [None]:
print(list_x)

In [None]:
list_x

Take a few seconds to look at the difference between the two objects above, so that you will recognize an array when you see it.

<br><br><br>Each number in the array is called an **element**.
<br><br>***Unlike a list, all of the elements in an array must be the same data type.***
<br><br>Let's run the same code only we'll make the last element a float instead of an integer:

In [None]:
x = np.array([[1, 2, 3], [4, 5, .6]])

In [None]:
print(x)

<br>Notice that it changed all of the elements to floats and added a decimal after each integer. 
<br><br>We can check the data type of the array with the `dtype` attribute:

In [None]:
x.dtype

<br>Some objects have **attributes**. These tell you something *about* an object and don't do anything to or with the object. They follow the object, but do not have parentheses. Not all objects have attributes, but they will only work without the parentheses:

In [None]:
x.dtype()

<br><br>***Another difference between a list and an array is that an array has a set size when it is created.*** You cannot add elements to an array - there is no append() function.

<br><br><br>Each dimension of an array is called an **axis**.
<br><br>`x` is a 2-dimensional array with 2 axes:

In [None]:
print(x)

<br>The first axis has a length of **2**.
<br>The second axis has a length of **3**.

<br><br>Let's make an array with 3 axes. We use additional sets of square brackets to group our dimensions:

In [None]:
y = np.array([[[10, 20, 30, 40], [11, 21, 31, 41], [12, 22, 32, 42]], 
              [[50, 60, 70, 80], [51, 61, 71, 81], [52, 62, 72, 82]]])

In [None]:
print(y)

<br>The first axis has a length of **2**.
<br>The second axis has a length of **3**.
<br>The third axis has a length of **4**.

#### <br><br>Array attributes

**Try out these attributes, which are especially handy with large arrays:**

How many dimensions (axes) are in your array?

In [None]:
y.ndim

What are the lengths of each axis?

In [None]:
y.shape

How many total elements are in the array?

In [None]:
y.size

The `size` is equal to the product of the lengths of all the axes in the array.

### <br><br>Exercise 1

In [2]:
my_array = np.array([["cat", "dog", "dog", "dog", "cat", "cat"], 
                     ["small", "large", "small", "medium", "small", "small"]])

Run the cell above to store the array. Write code to return the length of all the axes in `my_array`:

In [3]:
my_array.shape

(2, 6)

Write code to find out the data type of the data contained in `my_array`:

In [4]:
my_array.dtype

dtype('<U6')

### <br><br>Indexing arrays

Arrays are indexed in a similar way to other Python objects. You can index individual points or a range of points on each axis. If you want all points in an axis, use `:`.

Let's take another look at the array `y`:

In [None]:
print(y)

#### <br>For each indexed array below, try to guess what will be returned before you run the code:

In [None]:
y[0, 0, 0]

In [None]:
y[0, 0]

In [None]:
y[0]

In [None]:
y[:, 0, 0]

Notice that if you index multiple elements in an array, the answer is returned as an array.

In [None]:
y[-1, 1, 2]

In [None]:
y[:, :, -1]

In [None]:
y[0, 0:2, 0:2]

### <br><br>Exercise 2

In [5]:
my_array = np.array([[.1, .2, .3, .4, .5, .6], [.01, .02, .03, .04, .05, .06]])

Run the line of code above to store the array. Write code to index the element .4:

In [6]:
my_array[0, 3]

0.4

Write code to index the elements .04, .05, and .06:

In [7]:
my_array[1, 3:]

array([0.04, 0.05, 0.06])

### <br><br>Looping through arrays

When you loop through an array, the default is to loop through the first axis.

Another reminder of array `y`:

In [None]:
print(y)

In [None]:
for i in y:
    print("A loop:")
    print(i)

<br>To loop through multiple levels, you have to write loops within loops:

In [None]:
for i in y:
    print("AN OUTER LOOP:")
    for j in i:
        print("An inner loop:")
        print(j)

<br>You can also index the part of the array that you want to loop through:

In [None]:
for i in y[1, 2]:
    print(i)

<br>To loop through every element in the array, you can use the attribute `.flat`:

In [None]:
for i in y.flat:
    print(i)

<br>The `.flat` attribute will also allow you to make a list out of all the elements in an array, if that is something you ever need to do:

In [None]:
list(y.flat)

### <br><br>Exercise 3

This sample array has 2 axes. The first axis has a length of three - latitude, longitude, and an air quality index score.

In [8]:
air_quality = np.array([[41.8781, 42.0451, 41.8850, 41.7606, 42.0324, 41.5250, 42.3636, 42.0884], 
                        [87.6298, 87.6877, 87.7845, 88.3201, 87.7416, 88.0817, 87.8448, 87.9806], 
                        [59, 80, 101, 92, 120, 153, 94, 110]])

Write code to loop through the array and print each air quality index score that is 101 or higher:

In [9]:
for i in air_quality[-1]:
    if i >= 101:
        print(i)

101.0
120.0
153.0
110.0


<br><br>If you think you will be using numpy arrays in your research (for example, if you work with really large numerical datasets), I recommend starting with the NumPy quickstart guide: https://numpy.org/devdocs/user/quickstart.html. The guide will review some of today's topics, but it also covers using basic arthmetic operators and functions with arrays, as well as splitting and joining arrays.

## <br><br>pandas

- Pandas is one of the most commonly used Python packages/libraries for data science.<br><br>
- Pandas is Python's answer for making two dimensional tables (ala Excel and SQL).<br><br>
- Pandas calls a table a "DataFrame".<br><br>
- Pandas DataFrames are used by Python's other packages for statistical analysis, data manipulation, and data visualization.<br><br>
- Pandas DataFrames can be exported as .csv and other files.<br><br>

I've never met someone who loves pandas. The module, not the animal. The syntax isn't very instinctual. Some of the syntax will differ from basic Python. I still have to look a lot of things up in pandas, if it's something I don't do very often. However, it is the tool for working with spreadsheets in Python, so you'll need to learn it at some point.

### <br>import pandas

Because pandas is one of the most commonly used Python packages, it often gets imported as a shortened version of it's actual name. This makes it quicker to type.

In [10]:
import pandas as pd

Pandas comes with the Anaconda distribution of Python and is available on Google Colab.

<br><br>Today we are only going to focus on opening files in pandas and looking at the DataFrames. We will be working in pandas for the rest of the week.

#### If you are using Google Colab, you must run the next line of code. *If you are NOT using Google Colab, do NOT run the next line.*

In [None]:
!wget https://raw.githubusercontent.com/aGitHasNoName/dataScienceBasics/main/forestfires.csv
!wget https://raw.githubusercontent.com/aGitHasNoName/dataScienceBasics/main/pigeonRacing.txt
!wget https://raw.githubusercontent.com/aGitHasNoName/dataScienceBasics/main/zoo.xlsx

### <br><br>about the practice data

Tomorrow, we will be working with a dataset from forest fires in NE Portugal. I have included the dataset as a csv file in today's materials, but the data is available publically at this site: https://archive.ics.uci.edu/ml/datasets/Forest+Fires

### <br>loading a csv file

We will use the function `pd.read_csv()`. This will automatically create a **DataFrame** object, which we are saving as `df`. `df` is a common variable name for a DataFrame. You can open the file, define it as a Pandas DataFrame, assign it to a variable, and close the file in one line.

In [None]:
df = pd.read_csv("forestfires.csv")

### <br><br>viewing the DataFrame

In [None]:
df

<br>Take a minute to look at the data. The DataFrame will have a slightly different look on Colab and Jupyter, and on different versions of Jupyter.
<br><br>The number at the beginning of each row is called an **index**. The index was automatically assigned by pandas when the dataset was loaded. It was not in the original csv file. It is merely a series of consecutive numbers going down the rows. The rows were loaded in whatever order they were in the csv file.

<br><br>There are ways to view pieces of the DataFrame. Try these to see what they do:

In [None]:
df.head()

In [None]:
df.head(10)

In [None]:
df.tail()

In [None]:
df.tail(2)

In [None]:
df.sample()

In [None]:
df.sample(6)

### <br><br>loading other types of files

We can open a tab-separated file using the same function we used to open a csv. We just have to pass a second argument, a **keyword argument**, to tell it that the delimiter is a tab instead of the default (comma). This dataset contains rankings of profressional racing pigeons.

In [None]:
pigeon_df = pd.read_csv("pigeonRacing.txt", delimiter="\t")

In [None]:
pigeon_df.head()

<br><br>We will use a different function to open an Excel file. This file has information about animals and has two sheets within the excel file. We will first load sheet 1 and then sheet 2. We have to pass the `read_excel()` function one extra argument to specify the sheet:

In [12]:
zoo_df = pd.read_excel("zoo.xlsx", sheet_name=0)

In [None]:
zoo_df.head()

In [None]:
zoo_class_df = pd.read_excel("zoo.xlsx", sheet_name=1)

In [None]:
zoo_class_df.head()

### <br><br>Exercise 4

Try to load two or three files from your own computer into pandas. Try with at least two different file types (csv, tab-delimited, excel).

<br>**If you are using Google Colab**, you will need to upload the files to Colab yourself. You can do this by clicking on the folder on the left menu. You should see a file tree come up that includes sample_data. Right click anywhere in this space and choose upload to upload your own files.

### <br><br>getting basic info about the DataFrame

You can use the `len()` function to find out how many rows are in a DataFrame object:

In [None]:
len(df)

<br>The `describe()` method will give you some very basic stats about each column in your DataFrame:

In [None]:
df.describe()

<br>The `shape` attribute will return the number of rows and columns as a tuple:

In [None]:
df.shape

You can even save the shape tuple as an object, in case you need to include it in any code:

In [None]:
df_shape = df.shape

In [None]:
print("Our DataFrame has " + str(df_shape[0]) + " rows and " + str(df_shape[1]) + " columns.")

<br>The `size` attribute will tell you the total number of elements in the DataFrame (size = rows x columns):

In [None]:
df.size

<br>To return a list of the column names, you can start with the `columns` attribute:

In [None]:
df.columns

Hmm. That looks strange because it is a pandas object. You can make it into a list so that it is easier to work with:

In [None]:
column_names = list(df.columns)
print(column_names)

<br>To find out the data types of the data found in each column, use the `dtypes` attribute:

In [None]:
df.dtypes

<br>To **transpose** a DataFrame (swap the rows and columns), you also use an attribute:

In [None]:
df.T

<br>Let's see if that changed our DataFrame object:

In [None]:
df

<br><br>It didn't change! DataFrames are **immutable objects** like strings and numpy arrays. To save the transposed DataFrame, we would have to reassign it to a variable:

In [None]:
df_t = df.T
df_t

### <br><br>Exercise 5

First run the following code cell to look at the zoo animals DataFrame:

In [13]:
zoo_df

Unnamed: 0,animal,hair,feathers,eggs,milk,airbourne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
0,aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,wallaby,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,1,1
97,wasp,1,0,1,0,1,0,0,0,0,1,1,0,6,0,0,0,6
98,wolf,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
99,worm,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,7


Write code to create a list of column names from `zoo_df`:

In [16]:
columns = list(zoo_df.columns)
print(columns)

['animal', 'hair', 'feathers', 'eggs', 'milk', 'airbourne', 'aquatic', 'predator', 'toothed', 'backbone', 'breathes', 'venomous', 'fins', 'legs', 'tail', 'domestic', 'catsize', 'type']


Write code to return the data type for each column in `zoo_df`:

In [17]:
zoo_df.dtypes

animal       object
hair          int64
feathers      int64
eggs          int64
milk          int64
airbourne     int64
aquatic       int64
predator      int64
toothed       int64
backbone      int64
breathes      int64
venomous      int64
fins          int64
legs          int64
tail          int64
domestic      int64
catsize       int64
type          int64
dtype: object