## Pandas

Pandas is a Python library used for data manipulation and analysis. It provides data structures and functions needed to work with structured data, such as data frames and series.

### Two main data structures in Pandas

- **Series**: A one-dimensional labeled array.
- **DataFrame**: A two-dimensional labeled data structure with columns of potentially different types.


A **DataFrame** is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns) in Python. It is a fundamental data structure in the pandas library, which is widely used for data manipulation and analysis.

### Key Features of a DataFrame:
- **Two-Dimensional:** Data is organized in rows and columns, much like a table in a database or an Excel spreadsheet.
- **Labeled Axes:** Both rows and columns have labels, which are used for accessing and modifying data.
- **Heterogeneous Data:** Each column in a DataFrame can contain data of different types (e.g., integers, floats, strings).
- **Size-Mutable:** You can add or remove rows and columns as needed.
- **Missing Data Handling:** DataFrames have built-in methods for dealing with missing data, such as NaN values.


In [5]:
import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print((df))

    Name  Age
0  Alice   25
1    Bob   30


### Common Operations:

- **Selection:** Accessing specific rows and columns using labels or index positions.
- **Filtering:** Applying conditions to filter rows.
- **Aggregation:** Performing statistical operations like sum, mean, etc., on data.
- **Merging/Joining:** Combining multiple DataFrames.
- **Reshaping:** Changing the structure of the DataFrame (e.g., pivoting).


We have taken a movie dataset and we will work on that

In [6]:
# Path to the uploaded CSV file
file_path = '/kaggle/input/movies/Hydra-Movie-Scrape.csv'

# Load the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Display the DataFrame
print(df.head())

# This By default returns the first 5 rows.



                                               Title  Year  \
0                        Patton Oswalt: Annihilation  2017   
1                                      New York Doll  2005   
2  Mickey's Magical Christmas: Snowed in at the H...  2001   
3                         Mickey's House of Villains  2001   
4                                      And Then I Go  2017   

                                             Summary  \
0  Patton Oswald, despite a personal tragedy, pro...   
1  A recovering alcoholic and recently converted ...   
2  After everyone is snowed in at the House of Mo...   
3  The villains from the popular animated Disney ...   
4  In the cruel world of junior high, Edwin suffe...   

                                       Short Summary  \
0  Patton Oswalt, despite a personal tragedy, pro...   
1  A recovering alcoholic and recently converted ...   
2  Mickey and all his friends hold their own Chri...   
3  The villains from the popular animated Disney ...   
4  In the 

Display all column names

In [7]:
column_names = df.columns
print(column_names)

Index(['Title', 'Year', 'Summary', 'Short Summary', 'Genres', 'IMDB ID',
       'Runtime', 'YouTube Trailer', 'Rating', 'Movie Poster', 'Director',
       'Writers', 'Cast'],
      dtype='object')


In [8]:
num_rows, num_columns = df.shape
print(f'Number of rows: {num_rows}')
print(f'Number of columns: {num_columns}')


Number of rows: 3940
Number of columns: 13


Filter movies that came after 2017

In [9]:
filtered_df = df[df['Year'] > 2017]
print(filtered_df)


                           Title  Year  \
6                   Peter Rabbit  2018   
10               Forever My Girl  2018   
11       Tom Segura: Disgraceful  2018   
15    Suicide Squad: Hell to Pay  2018   
16                      Wildling  2018   
...                          ...   ...   
3923                   Hurricane  2018   
3924         Destination Wedding  2018   
3934             Office Uprising  2018   
3935                  Skyscraper  2018   
3939                         UFO  2018   

                                                Summary  \
6     Based on the books by Beatrix Potter: Peter Ra...   
10    After being gone for a decade a country star r...   
11    Comedian Tom Segura rants about funny things a...   
15    Task Force X targets a powerful mystical objec...   
16    Anna spends her entire childhood under the car...   
...                                                 ...   
3923  1940. Great Britain stands alone in Europe aga...   
3924  The story of two 

Selections in Pandas refer to the various ways you can access, filter, or subset data within a DataFrame or Series. Here’s an overview of the different selection techniques:

### 1\. **Selecting Columns**

*   Single column

In [10]:
df['Title'].head(10) # returns series, head() to limit to 10 rows.

0                          Patton Oswalt: Annihilation
1                                        New York Doll
2    Mickey's Magical Christmas: Snowed in at the H...
3                           Mickey's House of Villains
4                                        And Then I Go
5                             An Extremely Goofy Movie
6                                         Peter Rabbit
7                                           Love Songs
8                                                   89
9                                       The Foster Boy
Name: Title, dtype: object

* Multiple Columns

In [11]:
df[['Title', 'Year']].head(10)  # Returns a DataFrame


Unnamed: 0,Title,Year
0,Patton Oswalt: Annihilation,2017
1,New York Doll,2005
2,Mickey's Magical Christmas: Snowed in at the H...,2001
3,Mickey's House of Villains,2001
4,And Then I Go,2017
5,An Extremely Goofy Movie,2000
6,Peter Rabbit,2018
7,Love Songs,2007
8,89,2017
9,The Foster Boy,2011


### 1\. **Selecting Rows**

*   By Index

In [12]:
df.loc[15]  # Selects a row by its index label


Title                                     Suicide Squad: Hell to Pay
Year                                                            2018
Summary            Task Force X targets a powerful mystical objec...
Short Summary      Task Force X targets a powerful mystical objec...
Genres                                              Action|Animation
IMDB ID                                                    tt7167602
Runtime                                                           86
YouTube Trailer                                          EPZZvk-wbGE
Rating                                                           7.2
Movie Poster       https://hydramovies.com/wp-content/uploads/201...
Director                                                     Sam Liu
Writers                                                 Alan Burnett
Cast                               Christian Slater|Vanessa Williams
Name: 15, dtype: object

### 3\. **Selecting Rows and Columns Together**

In [13]:
a = df.loc[15, 'Title']# Select a specific cell by row label and column name
b = df.loc[15, ['Title', 'Year']]  # Select specific row and multiple columns
c = df.loc[:, ['Title', 'Year']].head(10)  # Select all rows but specific columns

print(a)
print("**********************")
print(b)
print("**********************")
print(c)

Suicide Squad: Hell to Pay
**********************
Title    Suicide Squad: Hell to Pay
Year                           2018
Name: 15, dtype: object
**********************
                                               Title  Year
0                        Patton Oswalt: Annihilation  2017
1                                      New York Doll  2005
2  Mickey's Magical Christmas: Snowed in at the H...  2001
3                         Mickey's House of Villains  2001
4                                      And Then I Go  2017
5                           An Extremely Goofy Movie  2000
6                                       Peter Rabbit  2018
7                                         Love Songs  2007
8                                                 89  2017
9                                     The Foster Boy  2011


### 4\. **Boolean Indexing (Conditional Selection)**

*   Select Rows based on a condition



In [14]:
df[df['Year'] > 2015].head(10)  # Select rows where the condition is True


Unnamed: 0,Title,Year,Summary,Short Summary,Genres,IMDB ID,Runtime,YouTube Trailer,Rating,Movie Poster,Director,Writers,Cast
0,Patton Oswalt: Annihilation,2017,"Patton Oswald, despite a personal tragedy, pro...","Patton Oswalt, despite a personal tragedy, pro...",Uncategorized,tt7026230,66,4hZi5QaMBFc,7.4,https://hydramovies.com/wp-content/uploads/201...,Bobcat Goldthwait,Patton Oswalt,Patton Oswalt
4,And Then I Go,2017,"In the cruel world of junior high, Edwin suffe...","In the cruel world of junior high, Edwin suffe...",Drama,tt2018111,99,8CdIiD6-iF0,7.6,https://hydramovies.com/wp-content/uploads/201...,Vincent Grashaw,Brett Haley,Arman Darbo|Sawyer Barth
6,Peter Rabbit,2018,Based on the books by Beatrix Potter: Peter Ra...,Feature adaptation of Beatrix Potter's classic...,Adventure|Animation|Comedy|Family|Fantasy,tt5117670,95,7Pa_Weidt08,6.6,https://hydramovies.com/wp-content/uploads/201...,Will Gluck,Rob Lieber,Fayssal Bazzi|James Corden
8,89,2017,89 tells the incredible story of one of footba...,"The true story of a sporting miracle, when Ars...",Uncategorized,tt7614404,91,5hfAExhHTMM,8.1,https://hydramovies.com/wp-content/uploads/201...,Dave Stewart,Lee Dixon,Ian Wright
10,Forever My Girl,2018,After being gone for a decade a country star r...,After being gone for a decade a country star r...,Drama|Music|Romance,tt4103724,108,3vqcMr1q5Uc,6.4,https://hydramovies.com/wp-content/uploads/201...,Bethany Ashton Wolf,Bethany Ashton Wolf,Abby Ryder Fortson|Alex Roe|Jessica Rothe
11,Tom Segura: Disgraceful,2018,Comedian Tom Segura rants about funny things a...,Comedian Tom Segura rants about funny things a...,Comedy|Documentary,tt7379330,0,kYYINJM3lPA,7.5,https://hydramovies.com/wp-content/uploads/201...,Jay Karas,Tom Segura,Tom Segura
14,Silent Night,2017,Adam unexpectedly visits his family house at C...,Adam unexpectedly visits his family house at C...,Comedy|Drama,tt7133554,100,cA6BUYVkQoE,7.5,https://hydramovies.com/wp-content/uploads/201...,Piotr Domalewski,Piotr Domalewski,Agnieszka Suchora|Dawid Ogrodnik|Tomasz Zietek
15,Suicide Squad: Hell to Pay,2018,Task Force X targets a powerful mystical objec...,Task Force X targets a powerful mystical objec...,Action|Animation,tt7167602,86,EPZZvk-wbGE,7.2,https://hydramovies.com/wp-content/uploads/201...,Sam Liu,Alan Burnett,Christian Slater|Vanessa Williams
16,Wildling,2018,Anna spends her entire childhood under the car...,A blossoming teenager uncovers the dark secret...,Fantasy|Horror,tt5085924,92,eyl1Wf90AgY,6.1,https://hydramovies.com/wp-content/uploads/201...,Fritz Böhm,Fritz Böhm,Bel Powley|Brad Dourif|Liv Tyler
17,The Humanity Bureau,2017,A dystopian thriller set in the year 2030 that...,A dystopian thriller set in the year 2030 that...,Action|Sci-Fi,tt6143568,95,kUH8JGhRzPY,6.1,https://hydramovies.com/wp-content/uploads/201...,Rob W. King,Dave Schultz,Jakob Davies|Nicolas Cage|Sarah Lind


* Multiple Conditions

In [15]:
df[(df['Year'] > 2015) & (df['Rating'] > 8)]  # Using & (and)
# df[(df['Year'] > 2015) | (df['Rating'] < 8)]  # Using | (or)


Unnamed: 0,Title,Year,Summary,Short Summary,Genres,IMDB ID,Runtime,YouTube Trailer,Rating,Movie Poster,Director,Writers,Cast
8,89,2017,89 tells the incredible story of one of footba...,"The true story of a sporting miracle, when Ars...",Uncategorized,tt7614404,91,5hfAExhHTMM,8.1,https://hydramovies.com/wp-content/uploads/201...,Dave Stewart,Lee Dixon,Ian Wright
21,Andre the Giant,2018,A look at the life and career of professional ...,A look at the life and career of professional ...,Documentary,tt6543420,85,f_jTeuajas0,8.2,https://hydramovies.com/wp-content/uploads/201...,Jason Hehir,Robin Wright,Cary Elwes
125,Chasing Coral,2017,Coral reefs around the world are vanishing at ...,Coral reefs around the world are vanishing at ...,Documentary,tt6333054,93,b6fHA9R2cKI,8.1,https://hydramovies.com/wp-content/uploads/201...,Jeff Orlowski,Davis Coombe,Andrew Ackerman|Pim Bongaerts
127,The Farthest,2017,Is it humankind's greatest achievement? 12 bil...,It is one of humankind's greatest achievements...,Documentary|History,tt6223974,121,znTdk_de_K8,8.1,https://hydramovies.com/wp-content/uploads/201...,Emer Reynolds,Emer Reynolds,Carolyn Porco|Frank Drake|John Casani
186,Your Name,2016,Mitsuha is the daughter of the mayor of a smal...,Two strangers find themselves linked in a biza...,Animation|Drama|Fantasy|Romance,tt5311514,106,VgixlvX28-g,8.4,https://hydramovies.com/wp-content/uploads/201...,Makoto Shinkai,Makoto Shinkai,Mone Kamishiraishi|Ryûnosuke Kamiki
192,"Three Billboards Outside Ebbing, Missouri",2017,"THREE BILLBOARDS OUTSIDE EBBING, MISSOURI is a...",A mother personally challenges the local autho...,Crime|Drama,tt5027774,115,Jit3YhGx5pU,8.2,https://hydramovies.com/wp-content/uploads/201...,Martin McDonagh,Martin McDonagh,Frances McDormand|Sam Rockwell|Woody Harrelson
221,Coco,2017,Despite his family's baffling generations-old ...,"Aspiring musician Miguel, confronted with his ...",Adventure|Animation|Comedy|Family|Fantasy,tt2380307,0,6Zxj9q8Yjdw,8.5,https://hydramovies.com/wp-content/uploads/201...,Lee Unkrich,Lee Unkrich,Anthony Gonzalez|Gael García Bernal
238,Cuba and the Cameraman,2017,Life in Cuba for three struggling families ove...,Life in Cuba for three struggling families ove...,Documentary,tt7320560,113,lsZ8hDutkeM,8.2,https://hydramovies.com/wp-content/uploads/201...,Jon Alpert,Jon Alpert,
260,Blade Runner 2049,2017,Thirty years after the events of the first fil...,A young blade runner's discovery of a long-bur...,Drama|Mystery|Sci-Fi|Thriller,tt1856101,164,gCcx85zbxz4,8.1,https://hydramovies.com/wp-content/uploads/201...,Denis Villeneuve,Hampton Fancher,Harrison Ford|Ryan Gosling
439,Logan,2017,In 2029 the mutant population has shrunken sig...,"In the near future, a weary Logan cares for an...",Action|Drama|Sci-Fi|Thriller,tt3315342,137,DekuSxJgpbY,8.1,https://hydramovies.com/wp-content/uploads/201...,James Mangold,James Mangold,Hugh Jackman|Patrick Stewart


### 5\. **Selecting With Query**


In [16]:
# This will filter titles that start with a letter 'P' alphabetically
# df_filtered = df[df['Title'].str[0] == 'P']

# get only title and year column not whole rows
df_filtered = df.query("Title.str[0] == 'P'")[['Title', 'Year', 'Rating']].head(10)
print(df_filtered)

                           Title  Year  Rating
0    Patton Oswalt: Annihilation  2017     7.4
6                   Peter Rabbit  2018     6.6
32             Perfect Strangers  2017     7.0
33                       Paterno  2018     6.6
41            Petals on the Wind  2014     6.3
66                Phantom Thread  2017     7.7
115                      Prodigy  2017     6.6
116              Pitch Perfect 3  2017     6.0
119              Pan's Labyrinth  2006     8.2
124                 Paddington 2  2017     8.0


In [17]:
# for Multiple Conditions
df_filtered = df.query("Title.str[0] == 'P' and Rating > 7")[['Title','Rating']].head(10)  # Multiple conditions
print(df_filtered)

                                                 Title  Rating
0                          Patton Oswalt: Annihilation     7.4
66                                      Phantom Thread     7.7
119                                    Pan's Labyrinth     8.2
124                                       Paddington 2     8.0
244  Phineas and Ferb the Movie: Across the 2nd Dim...     7.5
255             Professor Marston and the Wonder Women     7.1
432                                   Peaceful Warrior     7.3
472                                           Paterson     7.4
475                                       Patriots Day     7.4
654                            Pelé: Birth of a Legend     7.2


### 6\. **Selecting Specific Columns Using Column Name Patterns**

*   **Using .filter()**

In [18]:
df.filter(like='er')  # Selects columns with names containing 'substring'
# df.filter(regex='regex_pattern')  # Selects columns based on a regular expression


Unnamed: 0,YouTube Trailer,Movie Poster,Writers
0,4hZi5QaMBFc,https://hydramovies.com/wp-content/uploads/201...,Patton Oswalt
1,jwD04NsnLLg,https://hydramovies.com/wp-content/uploads/201...,Arthur Kane
2,uCKwHHftrU4,https://hydramovies.com/wp-content/uploads/201...,Thomas Hart
3,JA03ciYt-Ek,https://hydramovies.com/wp-content/uploads/201...,Thomas Hart
4,8CdIiD6-iF0,https://hydramovies.com/wp-content/uploads/201...,Brett Haley
...,...,...,...
3935,t9QePUT-Yt8,https://hydramovies.com/wp-content/uploads/201...,Rawson Marshall Thurber
3936,bVDGukfxFAk,https://hydramovies.com/wp-content/uploads/201...,Matt Booi
3937,bv0Eh2VhTTA,https://hydramovies.com/wp-content/uploads/201...,Joseph Nasser
3938,8TKLR1_JVLU,https://hydramovies.com/wp-content/uploads/201...,Mark Zakarin


### Filtering in Pandas: Overview and Methods

**Filtering** in Pandas involves selecting a subset of rows from a DataFrame based on specific conditions. It’s a fundamental operation for data analysis, allowing you to isolate the data you need for further processing or analysis.

**Methods**:

*   **Single Condition Filtering**:
    
    *   Select rows based on a condition applied to one column (e.g., all rows where a column value is above a threshold).
        
*   **Multiple Condition Filtering**:
    
    *   Combine conditions using logical operators like AND (&), OR (|), and NOT (~) to refine the selection.
        
*   **String-Based Filtering**:
    
    *   Filter rows based on string patterns, such as whether a string starts with, ends with, or contains a certain substring.
        
*   **Filtering with .isin()**:
    
    *   Select rows where a column’s value is in a list of specific values.
        
*   **Range Filtering with .between()**:
    
    *   Filter rows where a column’s value lies within a specific range.
        
*   **SQL-like Filtering with query()**:
    
    *   Use a more readable, SQL-like syntax to filter DataFrames based on conditions.
        
*   **Handling Missing Data**:
    
    *   Filter out rows with missing data using .isnull(), .notnull(), and .dropna().
        
*   **Filtering Duplicates**:
    
    *   Identify and filter out duplicate rows based on one or more columns.
        
*   **Top/Bottom N Filtering**:
    
    *   Select the largest or smallest N values in a column using .nlargest() and .nsmallest().

### Grouping in Pandas

**Grouping** in Pandas allows you to aggregate and analyze data by grouping it based on one or more columns. This is useful for summarizing data and performing calculations on subsets of your dataset.

#### Key Points:

1.  **Key Functions**:
    
    *   **groupby()**: Main function to group data by one or more columns.
        
    *   **Aggregation Functions**: Functions like sum(), mean(), count(), min(), max(), etc., to calculate statistics on each group.
        
    *   **Transformation Functions**: Functions like transform() and apply() to perform operations on each group.
        
2.  **Common Operations**:
    
    *   **Grouping by Columns**: Group data based on values in one or more columns.
        
    *   **Applying Aggregations**: Calculate statistics like mean, sum, or count for each group.
        
    *   **Filtering Groups**: Apply conditions to filter groups based on aggregate results

In [19]:
# Dummy Data
data = {
    'Department': ['HR', 'HR', 'Finance', 'Finance', 'IT', 'IT', 'IT'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Salary': [70000, 75000, 80000, 85000, 90000, 95000, 100000],
    'Experience': [2, 5, 7, 8, 3, 6, 9]
}

df = pd.DataFrame(data)
print(df)

  Department Employee  Salary  Experience
0         HR    Alice   70000           2
1         HR      Bob   75000           5
2    Finance  Charlie   80000           7
3    Finance    David   85000           8
4         IT      Eve   90000           3
5         IT    Frank   95000           6
6         IT    Grace  100000           9


#### Q1. Group by Department and compute the sum salary and exp for each department

In [20]:
grouped_df = df.groupby('Department').sum()
print("\nGrouped DataFrame:")
print(grouped_df)



Grouped DataFrame:
                 Employee  Salary  Experience
Department                                   
Finance      CharlieDavid  165000          15
HR               AliceBob  145000           7
IT          EveFrankGrace  285000          18


#### Q2. Group by dept and get mean of salary and experience?

In [28]:
g = df.groupby('Department')
c = g[['Salary', 'Experience']].mean()
print(c)

             Salary  Experience
Department                     
Finance     82500.0         7.5
HR          72500.0         3.5
IT          95000.0         6.0


#### Q3. Add a new column that shows mean salary of each dept

In [32]:
df['Mean_Salary'] = g['Salary'].transform('mean')
print(df)

  Department Employee  Salary  Experience  Mean_Salary
0         HR    Alice   70000           2      72500.0
1         HR      Bob   75000           5      72500.0
2    Finance  Charlie   80000           7      82500.0
3    Finance    David   85000           8      82500.0
4         IT      Eve   90000           3      95000.0
5         IT    Frank   95000           6      95000.0
6         IT    Grace  100000           9      95000.0


In pandas, merging is the process of combining two or more DataFrames based on a common key or index. It is similar to SQL joins and allows you to integrate data from different sources into a single DataFrame.

### Key Points:

*   **merge() Function**: The primary function used for merging DataFrames.
    
*   **Keys**: Merging is done based on common columns (keys) or indices.
    
*   **Types of Joins**: Supports different join types such as inner, outer, left, and right joins.
    

For example, merging can combine employee details from one DataFrame with department information from another based on a common 'Department\_ID'

In [34]:
department_data = {
    'Department': ['HR', 'Finance', 'IT'],
    'Department_ID': [10, 20, 30],
    'Location': ['Building A', 'Building B', 'Building C']
}

df_department = pd.DataFrame(department_data)

# Merge the two DataFrames on the 'Department' column
merged_df = pd.merge(df, df_department, on='Department', how='inner')

print(merged_df)


  Department Employee  Salary  Experience  Mean_Salary  Department_ID  \
0         HR    Alice   70000           2      72500.0             10   
1         HR      Bob   75000           5      72500.0             10   
2    Finance  Charlie   80000           7      82500.0             20   
3    Finance    David   85000           8      82500.0             20   
4         IT      Eve   90000           3      95000.0             30   
5         IT    Frank   95000           6      95000.0             30   
6         IT    Grace  100000           9      95000.0             30   

     Location  
0  Building A  
1  Building A  
2  Building B  
3  Building B  
4  Building C  
5  Building C  
6  Building C  


In the above code:
*   **on='Department'**: Specifies that the merging should be done based on the Department column, which is common in both DataFrames.
    
*   **how='inner'**: Specifies an inner join, meaning that only the rows with matching Department values in both DataFrames will be included in the merged result.