# Creating, Loading, and Selecting Data with Pandas

## Create
***

### Create a DataFrame I

A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.

You can pass in a dictionary to `pd.DataFrame()`. Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error.

In [19]:
import pandas as pd

In [97]:
df1 = pd.DataFrame({
    'Product ID': [1, 2, 3, 4],
    'Product Name': ['t-shirt', 't-shirt', 'skirt', 'skirt'],
    'Color': ['blue', 'green', 'red', 'black']
})
#df1

### Create a DataFrame II

You can also add data using lists.

For example, you can pass in a list of lists, where each one represents a row of data. Use the keyword argument columns to pass a list of column names.

In [21]:
df2 = pd.DataFrame([[1,'Paris',100],[2,'NYC',70],[3,'Rome', 120],[4,'Aspen',88]], columns= ['Store Id','Location','Number of Employees'])
#df2.set_index('Store Id')

## Loading
***

### Loading and Saving CSVs
When you have data in a CSV, you can load it into a DataFrame in Pandas using `.read_csv()`:

In [22]:
#pd.read_csv('my-csv-file.csv')

In the example above, the .read_csv() method is called. The CSV file called my-csv-file is passed in as an argument.

We can also **save data to a CSV**, using `.to_csv()`.

In [23]:
#df.to_csv('new-csv-file.csv')

In the example above, the `.to_csv()` method is called on df (which represents a DataFrame object). The name of the CSV file is passed in as an argument (new-csv-file.csv). By default, this method will save the CSV file in your current directory.

### Inspect a DataFrame

#### Head()

The method **`.head()`** gives the **first 5 rows of a DataFrame**. If you want to see more rows, you can pass in the positional argument n. For example, `df.head(10)` would show the first 10 rows.

The method `df.info()` gives some statistics for each column.

In [24]:
df = pd.read_csv('imdb.csv')
df.head(7)
#df.info()

Unnamed: 0,id,name,genre,year,imdb_rating
0,1,Avatar,action,2009,7.9
1,2,Jurassic World,action,2015,7.3
2,3,The Avengers,action,2012,8.1
3,4,The Dark Knight,action,2008,9.0
4,5,Star Wars: Episode I - The Phantom Menace,action,1999,6.6
5,6,Star Wars,action,1977,8.7
6,7,Avengers: Age of Ultron,action,2015,7.9


## Selecting
***

### Select Columns


Suppose you have the DataFrame called `customers`, which contains the ages of your customers:

|name |age | 
| --- | --- |
| Rebecca Erikson | 35 | 
| Thomas Roberson | 28 |
| Diane Ochoa | 42 |

Perhaps you want to take the average or plot a histogram of the ages. In order to do either of these tasks, you’d need to select the column.

There are two possible syntaxes for selecting all values from a column:

* Select the column as if you were **selecting a value from a dictionary using a key**. In our example, we would type `customers['age']` to select the ages.


* If the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.), then you can select it using the following notation: `df.MySecondColumn`. In our example, we would type **`customers.age`**.


In [75]:
df = pd.DataFrame(
    [['January', 100, 100, 23, 100], ['February', 51, 45, 145, 45],
     ['March', 81, 96, 65, 96], ['April', 80, 80, 54, 180],
     ['May', 51, 54, 54, 154], ['June', 112, 109, 79, 129]],
    columns=[
        'month', 'clinic_east', 'clinic_north', 'clinic_south', 'clinic_west'
    ])

clinic_north = df['clinic_north']
clinic_north2 = df.clinic_north

### Selecting Multiple Columns

To select two or more columns from a DataFrame, we use a list of the column names.<br>To **create the DataFrame** , we would use:



In [36]:
multiple_columns = df[['clinic_east', 'clinic_west']]
#multiple_columns

### Select Rows


DataFrames are zero-indexed, meaning that we start with the 0th row and count up from there. We want to select 'March', so that month is the 2nd row.

We select it using the following command:


In [53]:
march = df.iloc[2]

### Select Multiple Rows


Here are some different ways of selecting multiple rows:

In [69]:
#print(df.iloc[:4])#from 0 up to 4

In [68]:
#print(df.iloc[1:5])#from 1 up to 5(not included)

In [67]:
#print(df.iloc[-3:]) #The last 3 elements of the DataFrame

### Select Rows with Logic I

You can select a subset of a DataFrame by using logical statements:

In [72]:
april_reports = df[df.month == 'April']
april_reports

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
3,April,80,80,54,180


### Select Rows with Logic II

You can also combine multiple logical statements, as long as each statement is in parentheses.

Syntax: `df[(df.age < 30) | (df.name == 'Martha Jones')]`

In [83]:
clinic_east_top_reports = df[(df.clinic_east > 80) & (df.clinic_east < 150)]
clinic_east_top_reports

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
0,January,100,100,23,100
2,March,81,96,65,96
5,June,112,109,79,129


### Select Rows with Logic III

Suppose we want to select the rows where the month is either “January”, “April” or “June”.

In [87]:
df[df.month.isin(['January','April','June'])]

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
0,January,100,100,23,100
3,April,80,80,54,180
5,June,112,109,79,129


## Setting indices
***

When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. This is inelegant and makes it hard to use `.iloc().`

We can fix this using the method `.reset_index()`

For example: 

In [90]:
new_df = df[df.month.isin(['January', 'April', 'June'])]
new_df

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
0,January,100,100,23,100
3,April,80,80,54,180
5,June,112,109,79,129


here is a DataFrame called `new_df` with non-consecutive indices

If we use the command `new_df.reset_index()`, **!!we get a new DataFrame with a new set of indices!!** :

In [91]:
new_df.reset_index()

Unnamed: 0,index,month,clinic_east,clinic_north,clinic_south,clinic_west
0,0,January,100,100,23,100
1,3,April,80,80,54,180
2,5,June,112,109,79,129


To delete that extra column created by the method `reset_index()`, we can add this parameter `drop=True`,  so that you don’t end up with that extra column

In [95]:
new_df.reset_index(drop=True)

Unnamed: 0,month,clinic_east,clinic_north,clinic_south,clinic_west
0,January,100,100,23,100
1,April,80,80,54,180
2,June,112,109,79,129


**Using `.reset_index()` will return a new DataFrame**, but we usually just want to modify our existing DataFrame. If we use the keyword `inplace=True` we can just modify our existing DataFrame.

In [93]:
df.reset_index(drop=True, inplace=True)