# Creating, Loading, and Selecting Data with Pandas

## Create
***

### Create a DataFrame I

A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.

You can pass in a dictionary to `pd.DataFrame()`. Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error.

In [8]:
import pandas as pd

In [20]:
df1 = pd.DataFrame({
    'Product ID': [1, 2, 3, 4],
    'Product Name': ['t-shirt', 't-shirt', 'skirt', 'skirt'],
    'Color': ['blue', 'green', 'red', 'black']
})
#df1

### Create a DataFrame II

You can also add data using lists.

For example, you can pass in a list of lists, where each one represents a row of data. Use the keyword argument columns to pass a list of column names.

In [19]:
df2 = pd.DataFrame([[1,'Paris',100],[2,'NYC',70],[3,'Rome', 120],[4,'Aspen',88]], columns= ['Store Id','Location','Number of Employees'])
#df2.set_index('Store Id')

## Loading
***

### Loading and Saving CSVs
When you have data in a CSV, you can load it into a DataFrame in Pandas using `.read_csv()`:

In [22]:
#pd.read_csv('my-csv-file.csv')

In the example above, the .read_csv() method is called. The CSV file called my-csv-file is passed in as an argument.

We can also **save data to a CSV**, using `.to_csv()`.

In [24]:
#df.to_csv('new-csv-file.csv')

In the example above, the `.to_csv()` method is called on df (which represents a DataFrame object). The name of the CSV file is passed in as an argument (new-csv-file.csv). By default, this method will save the CSV file in your current directory.

### Inspect a DataFrame

When we load a new DataFrame from a CSV, we want to know what it looks like.

If it’s a **small DataFrame**, you can display it by typing `print(df)`.

If it’s a larger DataFrame, it’s helpful to be able to inspect a few items without having to look at the entire DataFrame.

The method `.head()` gives the **first 5 rows of a DataFrame**. If you want to see more rows, you can pass in the positional argument n. For example, `df.head(10)` would show the first 10 rows.

The method `df.info()` gives some statistics for each column.

In [31]:
df = pd.read_csv('imdb.csv')
df.head(7)
#df.info()

Unnamed: 0,id,name,genre,year,imdb_rating
0,1,Avatar,action,2009,7.9
1,2,Jurassic World,action,2015,7.3
2,3,The Avengers,action,2012,8.1
3,4,The Dark Knight,action,2008,9.0
4,5,Star Wars: Episode I - The Phantom Menace,action,1999,6.6
5,6,Star Wars,action,1977,8.7
6,7,Avengers: Age of Ultron,action,2015,7.9


## Selecting
***

### Select Columns


Suppose you have the DataFrame called `customers`, which contains the ages of your customers:

|name |age | 
| --- | --- |
| Rebecca Erikson | 35 | 
| Thomas Roberson | 28 |
| Diane Ochoa | 42 |

Perhaps you want to take the average or plot a histogram of the ages. In order to do either of these tasks, you’d need to select the column.

There are two possible syntaxes for selecting all values from a column:

* Select the column as if you were **selecting a value from a dictionary using a key**. In our example, we would type `customers['age']` to select the ages.


* If the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.), then you can select it using the following notation: `df.MySecondColumn`. In our example, we would type **`customers.age`**.


In [33]:
df = pd.DataFrame(
    [['January', 100, 100, 23, 100], ['February', 51, 45, 145, 45],
     ['March', 81, 96, 65, 96], ['April', 80, 80, 54, 180],
     ['May', 51, 54, 54, 154], ['June', 112, 109, 79, 129]],
    columns=[
        'month', 'clinic_east', 'clinic_north', 'clinic_south', 'clinic_west'
    ])

clinic_north = df['clinic_north']
clinic_north

0    100
1     45
2     96
3     80
4     54
5    109
Name: clinic_north, dtype: int64