INTRO TO PANDAS FOR DATA ANALYSIS

OBJECTIVE: 
1. LEARN PANDAS SERIES AND HOW TO CREATE THEM
2. ACCESS AND MANIPULATE DATA WITHIN SERIES 
3. CREATING AND WORKING WITH DATAFRAMES 
4. ACCESS, MODIFY AND ANALYZE DATA

What is Pandas?

Pandas is a popular open-source data manipulation and analysis library for the Python programming language. It provides a powerful and flexible set of tools for working with structured data, making it a fundamental tool for data scientists, analysts, and engineers.
Pandas is designed to handle data in various formats, such as tabular data, time series data, and more, making it an essential part of the data processing workflow in many industries.

Here are some key features and functionalities of Pandas:

Data Structures: Pandas offers two primary data structures - DataFrame and Series.

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
A Series is a one-dimensional labeled array, essentially a single column or row of data.
Data Import and Export: Pandas makes it easy to read data from various sources, including CSV files, Excel spreadsheets, SQL databases, and more. It can also export data to these formats, enabling seamless data exchange.

Data Merging and Joining: You can combine multiple DataFrames using methods like merge and join, similar to SQL operations, to create more complex datasets from different sources.

Efficient Indexing: Pandas provides efficient indexing and selection methods, allowing you to access specific rows and columns of data quickly.

Custom Data Structures: You can create custom data structures and manipulate data in ways that suit your specific needs, extending Pandas' capabilities.



In [15]:
import pandas as pd 

# Read the CSV file into a DataFrame 

df = pd.read_csv('csv_file_path.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'csv_file_path.csv'

<h3 class = "title"> What is Series? </h3>
<ul> 
    <li> One-dimensional labeled array in Pandas</li>
    <li> 
        Single column of data with labels or indices for each element 
    </li>
    <li> Create series from various data sources: lists, Numpy arrays, Dictionaries </li>

</ul>

In [None]:
import pandas as pd

# Create a Series from a list
data = [10, 20, 30, 40, 50]
s = pd.Series(data)

print(s)

0    10
1    20
2    30
3    40
4    50
dtype: int64


<h3> In this example: </h3>
<p> We've created a Series named s with numeric data </p>
<p> Pandas assigned numerical indives (0, 1, 2...) to each element </p>
<p> You can specify custom labels if needed</p>

<h3>Accessing Elements in a Series</h3>
<p>Access elements in a Series using index labels or int positions</p>

<h4> 1. Accessing by label</h4>
<h4> 2. Accessing by position</h4>
<h4> 3. Accessing multiple elements</h4>

In [None]:
# 1
print(s[2]) # Access the element with label 2

30


In [None]:
# 2
print(s.iloc[3]) # Access the element at position 3

40


In [None]:
# 3
print(s[1:4]) # Access a range of elements by label

1    20
2    30
3    40
dtype: int64


<h3> Series Attributes and Methods </h3>
<p> Pandas Seires come with various attributes and methods to help manipulate and analyze data effectively.</p>

<ul>
    <li>
        <b>Values: </b> 
        Return Series data as Numpy Array</li>
    <li
        ><b>Index: </b> 
        Return the index (labels) of the Series
    </li>
    <li>
        <b>Shape: </b> 
        Return tuple representng the dimensions of Series
    </li>
    <li>
        <b>Size: </b> Return the number of elements</li>
    <li><b>mean(), sum(), min(), max() </b> Calculate summary statistics of the data </li>
    <li><b>unique(), nunique() </b> Get unique values or the number of unique values</li>
    <li><b>sort_values(), sort_index() </b> Sort Series by values or index labels</li>
    <li><b>isnull(), notnull() </b> Check for missing (NaN) or non-missing values</li>
    <li><b>apply() </b> Apply custom function to each element</li>

</ul>

<h3>What is a DataFrames?</h3>
<ul>
    <li> Two-dimensional labeled data structure with columns (maybe different data types)</li>
    <li> Table where each column is a variable / each row is an observation or data point</li>
    <li> DataFrames suitable for wide range of data - structured data from: CSV file, Excel Spreadsheets, SQL databases, and more </li>
</ul>

<h3>Creating DataFrames from Dictionaries:
</h3>
<p> Data Frames use dictionary keys as column labels values as rows </p>

In [None]:
import pandas as pd

#Creating a dataframe from a dictionary

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 28],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
        }

df = pd.DataFrame(data)

print(df)

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   28        Chicago


<h3> Column Selection:
</h3>

<p> Select a single column from df by specifying column name within square brackets</p>
<p> Multiple columns can be selected similarly</p>


In [None]:
print(df['Name']) #print 'Name column'

0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object


In [None]:
#Accessing rows 
#You can access rows by their index using .iloc[] or by label using .loc[].

print(df.iloc[3]) # Access the third row by position
print('=========')
print(df.loc[1]) # Access the second row by label

Name      David
Age          28
City    Chicago
Name: 3, dtype: object
Name              Bob
Age                30
City    San Francisco
Name: 1, dtype: object


In [30]:
#Slicing 
#Slice DataFrames to select specific rows and columns
print(df[['Name', 'Age']]) # Select specific columns 
print("=========")
print(df[1:3]) # select specific rows

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   28
      Name  Age           City
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
