![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Module 3 Unit 2  - Importing Data Sets

While we looked at a number of different data structures in the last units, a lot of the data sets used in data science are organized as two-dimensional **arrays**—like a spreadsheet or table with multiple rows and columns.

Because of this, our focus will be on ways to create DataFrames from data sets.


### Adding data to a DataFrame

There are lots of ways to add data to DataFrames. Which method is best for a project depends on how much data there is and where we are getting it from.

Three common methods we'll explore are:

1. **Creating the DataFrame by hand** inside a Jupyter notebook—for small data sets

1. **Loading the DataFrame from a file in the Jupyter notebook**—for small to medium data sets

1. **Loading the DataFrame from an online resource**—useful for large data sets or when accessing secondary data sets from online repositories or open data portals, such as [Stats Canada](https://www150.statcan.gc.ca/n1/en/type/data) or The [Global Health Observatory](https://www.who.int/data/gho)
    

### Creating a DataFrame by hand

*Get the most out of this section by opening a Jupyter notebook in another window and following along. Code snippets provided in the course can be pasted directly into your Jupyter notebook. Review Module 2, Unit 5 for a refresher on creating and opening Jupyter notebooks in Callysto.*

As we saw earlier in the course, DataFrames organize data into rows and columns.

For small data sets, like our table of Canadian coins less than a dollar, we can just type the data points directly into our code by hand. This is similar to what we did in the DataFrame tutorial and activity in Unit 1.

An important step when constructing a DataFrame is creating our **dictionary.** A dictionary is the code that establishes each column's header and list of values under that header.

For example, in the code below we recreate our coins_df DataFrame.

    import pandas as pdfrom pandas import DataFrame
    data = {'name': ['penny', 'nickel', 'dime', 'quarter'],
     'value': [1, 5, 10, 25],
     'weight': [2.35, 3.95, 1.75, 4.4],
     'design': ['Maple Leaves', 'Beaver', 'Schooner', 'Caribou'] }
    DataFrame(data)
    
![dataframewithpandas](../_images/Module3-Unit2-image.png)

*A demonstration of how to show the output of a completed DataFrame, using Python programming. The Python programming used to show the completed DataFrame about the penny, nickel, dime, and quarter was "coins_df."*

The output is a DataFrame which holds all the data in a nicely organized format.

The far-left column is the DataFrame's row index, and by default is a column of numbers counting up from 0. However we can also make one of the columns we created the row index instead by adding the following line to our Jupyter notebook:

    DataFrame(data,index=['a','b','c','d'])
    
So, if you need to quickly analyze a small amount of data, it's easy to just type in the data directly into the Jupyter notebook. For larger amounts of data, you might need to transfer data in another format directly into a DataFrame.

For example, another way to create a DataFrame from data stored in code is to start with a numerical array, such as we might get from Numerical Python or **NumPy.**

### 🏷️ Key Term: NumPy
>Numerical Python (NumPy) is a library of code functions and data structures that are useful for a wide variety of numerical calculations. NumPy is widely used for scientific computations in Python.

We can convert a **two-dimensional (2D) array** directly into a DataFrame by passing the array code to DataFrame, like this:
### 

![NumPy](../_images/Module3-Unit2-image1.png)

*A demonstration of how to assign numerical values in a DataFrame through "Numerican Python." This Python programming function is also called "NumPy."*

### 🏷️ Key Term: Array
>An array is 2D when the values are organized into rows and columns, like a table on a 2D piece of paper. It's also possible for arrays to be 1D arrays and even 3D arrays, however we won't be exploring those in this course.

As we can see, the column headers and row index are numbers. To make the DataFrame more readable, we can establish labels for them by including a definition in our code.

![Array](../_images/Module3-Unit2-image2.png)

*A demonstration of how to create a pandas dataframe using arrays, and how to assign desired column names, as well as row index names. In this case, each array represents one row. Four arrays denote four columns. Each array is of size three. Correspondingly we add three column names: 'base', 'square', 'cube' and four row names 'one', 'two', 'three', 'four'.*

2D NumPy arrays aren't the only data structures that can be converted into a DataFrame. Others include:

* A 2D NumPy Masked array of numbers (which includes masked values)
* A dict of arrays, lists, or tuples (each dict item becomes a column)
* A dict of Series, or a dict of dicts
* A list of Series, or a list of dicts
* A list of lists, or a list of tuples
* Another DataFrame

When converting another data structure into a DataFrame, the syntax of our code will be the same as when we converted our 2D NumPy array.

The syntax is the same as above, in code like this line:

    array_df = DataFrame(myData)

where myData is one of the above data structures (array, dict, list, etc) and array_df is the DataFrame that is created from that data.

Creating DataFrames by hand works well for small amounts of data. However, it would get more difficult to do this by hand with large data sets.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)