# Chapter 1: Introduction to Pandas
* **What is Pandas?**
Pandas is the most popular Python library for data analysis and manipulation. It provides powerful, easy-to-use data structures and data analysis tools that make working with structured data intuitive and efficient.
Think of Pandas as Excel on steroids - but with programming power!

- **Core Data Structures**
  - Series (1-dimensional)
  - DataFrame (2-dimensional)

## Series (1-dimensional)
A Series is like a column in a spreadsheet - a one-dimensional array with labels (index).

In [10]:
# Creating Series
import pandas as pd
fruits = pd.Series(['apple','banana','orange'])
fruits

0     apple
1    banana
2    orange
dtype: object

In [12]:
# With Custom Index
prices = pd.Series([1.5, 0.5, 2.0], index=['apple', 'banana', 'orange'])
prices

apple     1.5
banana    0.5
orange    2.0
dtype: float64

In [13]:
# From a dictionary (keys become the index)
scores = pd.Series({'Math': 95, 'English': 87, 'Science': 92})
print(scores)

Math       95
English    87
Science    92
dtype: int64


In [None]:
# Key Series Properties

In [25]:
s = pd.Series([10,20,30] , index = ['a','b','c'])
print(type(s.values))
print(type(s.index))
print(s.dtype)
print(s.shape)
print(s)
print(len(s))
arr = np.array(s)
arr

<class 'numpy.ndarray'>
<class 'pandas.core.indexes.base.Index'>
int64
(3,)
a    10
b    20
c    30
dtype: int64
3


array([10, 20, 30])

In [28]:
print(s['a'])
print(s.iloc[0])

10
10


In [27]:
print(s['a':'c']) #inclusive!!!!

a    10
b    20
c    30
dtype: int64


## DataFrame (2-dimensional) ðŸ“‹
DataFrame is like an entire spreadsheet - a 2D labeled data structure with columns of potentially different types. It's the most important data structure in Pandas!

### Creating a DataFrame

In [30]:
# Method 1: From a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'Paris', 'London', 'Tokyo'],
    'Salary': [70000, 80000, 75000, 85000]
}
df = pd.DataFrame(data)
print(df)

      Name  Age      City  Salary
0    Alice   25  New York   70000
1      Bob   30     Paris   80000
2  Charlie   35    London   75000
3    David   28     Tokyo   85000


In [32]:
# Method 2: From a list of dictionaries
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Paris'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'London'}
]
df = pd.DataFrame(data)
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London


In [34]:
# Method 3: From a list of lists (need to specify columns)
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Paris'],
    ['Charlie', 35, 'London']
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London


In [2]:
# Method 4: From a NumPy array
import numpy as np
import pandas as pd
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9] , [1, 2, 3] , [1, 2, 3]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9
3,1,2,3
4,1,2,3


### Basic DataFrame Operations

In [14]:
# First and last rows
df.head()          # First 5 rows
df.head(10)        # First 10 rows
df.tail()          # Last 5 rows
df.tail(3)         # Last 3 rows
# Shape and dimensions
df.shape           # Returns: (rows, columns) e.g., (4, 4)
df.size            # Total number of elements
len(df)            # Number of rows
# Column and index information
df.columns         # Column names
df.index           # Index (row labels)
df.dtypes          # Data types of each column
# Summary information
df.info()          # Detailed information about DataFrame
df.describe()      # Statistical summary of numeric columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      int64
 1   B       5 non-null      int64
 2   C       5 non-null      int64
dtypes: int64(3)
memory usage: 252.0 bytes


Unnamed: 0,A,B,C
count,5.0,5.0,5.0
mean,2.8,3.8,4.8
std,2.683282,2.683282,2.683282
min,1.0,2.0,3.0
25%,1.0,2.0,3.0
50%,1.0,2.0,3.0
75%,4.0,5.0,6.0
max,7.0,8.0,9.0


## Selecting Columns

In [18]:
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Paris'],
    ['Charlie', 35, 'London']
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
df.Name
# OR
df['Name']

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

In [20]:
df[['Name','Age']]

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35


In [21]:
df.iloc[:, 0]      # First column

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

In [22]:
df.iloc[:, [0, 2]] # First and third columns

Unnamed: 0,Name,City
0,Alice,New York
1,Bob,Paris
2,Charlie,London


In [23]:
# By position (integer-based)
df.iloc[0]         # First row (returns Series)
df.iloc[0:3]       # First three rows (returns DataFrame)
df.iloc[[0, 2]]    # First and third rows

# By label (if you have custom index)
df.loc[0]          # Row with index label 0
df.loc[0:2]        # Rows with index labels 0 to 2 (inclusive!)

# Boolean indexing
df[df['Age'] > 28]                    # Rows where Age > 28
df[df['City'] == 'Paris']             # Rows where City is Paris
df[(df['Age'] > 25) & (df['Age'] < 35)]  # Multiple conditions

Unnamed: 0,Name,Age,City
1,Bob,30,Paris


### ðŸ’¡ Hands-On Practice

* Example 1: Creating Your First DataFrame

In [29]:
import pandas as pd

# Create a DataFrame of students
students = pd.DataFrame({
    'StudentID': [101, 102, 103, 104, 105],
    'Name': ['Emma', 'Liam', 'Olivia', 'Noah', 'Ava'],
    'Grade': ['A', 'B', 'A', 'C', 'B'],
    'Score': [95, 87, 92, 78, 85],
    'Attendance': [98, 95, 100, 92, 96]
})

print(students)


   StudentID    Name Grade  Score  Attendance
0        101    Emma     A     95          98
1        102    Liam     B     87          95
2        103  Olivia     A     92         100
3        104    Noah     C     78          92
4        105     Ava     B     85          96


In [25]:
print('\n--- DataFrame Info ---')
print('Shape:', students.shape)


--- DataFrame Info ---
Shape: (5, 5)


In [27]:
print('\nColumn Types:')
print(students.dtypes)


Column Types:
StudentID      int64
Name          object
Grade         object
Score          int64
Attendance     int64
dtype: object


In [28]:
print('\nStatistical Summary:')
print(students.describe())


Statistical Summary:
        StudentID      Score  Attendance
count    5.000000   5.000000     5.00000
mean   103.000000  87.400000    96.20000
std      1.581139   6.580274     3.03315
min    101.000000  78.000000    92.00000
25%    102.000000  85.000000    95.00000
50%    103.000000  87.000000    96.00000
75%    104.000000  92.000000    98.00000
max    105.000000  95.000000   100.00000


### Example 2: Basic Data Exploration

In [40]:
# Get statistical summary
print('Average score:', students['Score'].mean())
print('Highest score:', students['Score'].max())
print('Lowest score:', students['Score'].min())
print('Median score:', students['Score'].median())
print('Standard deviation:', students['Score'].std())

# Count values
print('\nGrade distribution:')
print(students['Grade'].value_counts())

# Filter data
high_performers = students[students['Score'] > 90]
print('\nHigh performers (Score > 90):')
print(high_performers)

Average score: 87.4
Highest score: 95
Lowest score: 78
Median score: 87.0
Standard deviation: 6.580273550544841

Grade distribution:
Grade
A    2
B    2
C    1
Name: count, dtype: int64

High performers (Score > 90):
   StudentID    Name Grade  Score  Attendance
0        101    Emma     A     95          98
2        103  Olivia     A     92         100


In [47]:
students = students.sort_values('Name')
students

Unnamed: 0,StudentID,Name,Grade,Score,Attendance
4,105,Ava,B,85,96
0,101,Emma,A,95,98
1,102,Liam,B,87,95
3,104,Noah,C,78,92
2,103,Olivia,A,92,100


### ðŸŽ¯ Practice Exercises
* Try these exercises to solidify your understanding:

* **Exercise 1: Movie Database**
* Create a DataFrame with at least 5 movies including:

* Title
* Year
* Rating (out of 10)
* Genre
* Box Office (in millions)

### Then:

* Display the first 3 rows and last 2 rows
* Find the average rating of all movies
* Select only the 'Title' and 'Rating' columns
* Filter movies with rating > 8.0
* Find the highest-grossing movie

In [75]:
df = pd.DataFrame([
    ['Ask Latften Anlamaz',2016,8,'Romantic',12], 
    ['Marchli',2021,8,'Action',12],
    ['Hercii',2019,9,'Romatic',76],
    ['zumrt anqa',2019,1,'Romatic',99],
    ['Yasak Elma',2017,10,'Romatic',11],
], columns=['Title','Year','Rating','Genre','Box Office'])
# First Three Rows
print(df.head(3))
print('*'*80)
print(df.tail(2))
print('*'*80)
print(df['Rating'].mean())
print('*'*80)
print(df[['Rating','Title']])
print('*'*80)
print(df[ df['Rating'] >8] )
print('*'*80)
print(df[df['Box Office'] == df['Box Office'].max() ])

                 Title  Year  Rating     Genre  Box Office
0  Ask Latften Anlamaz  2016       8  Romantic          12
1              Marchli  2021       8    Action          12
2               Hercii  2019       9   Romatic          76
********************************************************************************
        Title  Year  Rating    Genre  Box Office
3  zumrt anqa  2019       1  Romatic          99
4  Yasak Elma  2017      10  Romatic          11
********************************************************************************
7.2
********************************************************************************
   Rating                Title
0       8  Ask Latften Anlamaz
1       8              Marchli
2       9               Hercii
3       1           zumrt anqa
4      10           Yasak Elma
********************************************************************************
        Title  Year  Rating    Genre  Box Office
2      Hercii  2019       9  Romatic          76
4  Ya