# Data Processing in Pandas - Notes  

*This is first part of a two part series : Part 1, [Part 2](https://www.kaggle.com/ctxplorer/data-processing-in-pandas-ii)*

#### Content:
1. Creating a DataFrame
2. Loading and saving CSVs
3. Inspecting a DataFrame
4. Selecting columns
5. Selecting rows 
6. Selecting rows with logical conditions
7. Resetting indices 

In [1]:
import pandas as pd

## 1. Creating a DataFrame

#### Add data using Dictonary

In [2]:
df1 = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
    'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
    'age': [34, 28, 51]
})
print(df1)

         name         address  age
0  John Smith    123 Main St.   34
1    Jane Doe  456 Maple Ave.   28
2   Joe Schmo    789 Broadway   51


#####  The columns will appear in alphabetical order because dictionaries don't have any inherent order for columns

#### Add data using List

In [3]:
df2 = pd.DataFrame([
    ['John Smith', '123 Main St.', 34],
    ['Jane Doe', '456 Maple Ave.', 28],
    ['Joe Schmo', '789 Broadway', 51]
    ],
    columns=['name', 'address', 'age'])
print(df2)

         name         address  age
0  John Smith    123 Main St.   34
1    Jane Doe  456 Maple Ave.   28
2   Joe Schmo    789 Broadway   51


## 2. Loading and saving CSVs

In [4]:
# save data to a CSV
df1.to_csv('new-csv-file.csv')

# load CSV file into a DataFrame in Pandas
df3 = pd.read_csv('../input/sample-csv-file/sample.csv')

print(df3)

            City  Population  Median Age
0      Maplewood      100000          40
1          Wayne      350000          33
2  Forrest Hills      300000          35
3        Paramus      400000          55
4     Hackensack      290000          39


## 3. Inspecting a DataFrame

In [5]:
df4 = pd.read_csv('../input/imdb-data/IMDB-Movie-Data.csv')

# print first 3 rows of DataFrame (Default 5)
print(df4.head(3))

# print statistics for each columns
print(df4.info())

   Rank                    Title    ...    Revenue (Millions) Metascore
0     1  Guardians of the Galaxy    ...                333.13      76.0
1     2               Prometheus    ...                126.46      65.0
2     3                    Split    ...                138.12      62.0

[3 rows x 12 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
Rank                  1000 non-null int64
Title                 1000 non-null object
Genre                 1000 non-null object
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null int64
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.8+ KB
None


## 4. Selecting columns

#### Using name of column
##### used only if name of columns follows all the rules of variable naming

In [6]:
# Select column 'Title'
imdb_title = df4.Title
print(imdb_title.head())

0    Guardians of the Galaxy
1                 Prometheus
2                      Split
3                       Sing
4              Suicide Squad
Name: Title, dtype: object


#### Using key value

In [7]:
# Select column 'Runtime (Minutes)'
imdb_runtime_minutes = df4['Runtime (Minutes)']
print(imdb_runtime_minutes.head())

0    121
1    124
2    117
3    108
4    123
Name: Runtime (Minutes), dtype: int64


#### Selecting multiple columns

In [8]:
imdb_data = df4[['Title', 'Runtime (Minutes)']]
print(imdb_data.head())

                     Title  Runtime (Minutes)
0  Guardians of the Galaxy                121
1               Prometheus                124
2                    Split                117
3                     Sing                108
4            Suicide Squad                123


## 5. Selecting rows

In [9]:
# select fourth row
sing_movie = imdb_data.iloc[3]
print(sing_movie)

Title                Sing
Runtime (Minutes)     108
Name: 3, dtype: object


#### Selecting multiple rows

In [10]:
# select last third row
last_three_movies = imdb_data.iloc[-3:]
print(last_three_movies)

                      Title  Runtime (Minutes)
997  Step Up 2: The Streets                 98
998            Search Party                 93
999              Nine Lives                 87


## 6. Selecting rows with logical conditions

In [11]:
# select rows with runtime less than 75
short_movies = imdb_data[imdb_data['Runtime (Minutes)'] < 75]
print(short_movies)

                       Title  Runtime (Minutes)
42   Don't Fuck in the Woods                 73
793      Ma vie de Courgette                 66
819       Wolves at the Door                 73


#### Selecting rows with multiple logical conditions
##### Use paranthesis when combining multiple logical condition

In [12]:
# select rows with runtime between 60 and 80
medium_length_movies = imdb_data[(imdb_data['Runtime (Minutes)'] > 60) &
                                 (imdb_data['Runtime (Minutes)'] < 80)]
print(medium_length_movies)

                       Title  Runtime (Minutes)
42   Don't Fuck in the Woods                 73
793      Ma vie de Courgette                 66
819       Wolves at the Door                 73


#### Selecting rows with specific values

In [13]:
# select rows with title in the list
fav_movies = imdb_data[imdb_data.Title.isin([
    'Wolves at the Door', 'Guardians of the Galaxy'
])]
print(fav_movies)

                       Title  Runtime (Minutes)
0    Guardians of the Galaxy                121
819       Wolves at the Door                 73


## 7. Resetting indices  
##### When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. We can fix this using the method **.reset_index()**. 

In [14]:
# reset indices without changing the source DF
fav_movies = fav_movies.reset_index(drop=True)
print(fav_movies)

# reset indices in the source DF
medium_length_movies.reset_index(drop=True, inplace=True)
print(medium_length_movies)

                     Title  Runtime (Minutes)
0  Guardians of the Galaxy                121
1       Wolves at the Door                 73
                     Title  Runtime (Minutes)
0  Don't Fuck in the Woods                 73
1      Ma vie de Courgette                 66
2       Wolves at the Door                 73


### That is all for now. Hope it helped you!
#### Check out [Part 2](https://www.kaggle.com/ctxplorer/data-processing-in-pandas-ii) of the series.