# Importing & FIltering Data

In this notebook, we will go over the common steps to take when importing and filtering data. Importing and filtering data properly is a foundational skill, as we can easily import and work with a data file and shift through it to find specific data that fits our use case.

### Import Basic Packages

In [23]:
#Basics
import numpy as np
import pandas as pd

### Import Data from CSV

In this scenario we'll import a dataset of students enrolled in a school and explore ways to select and filter data of interest.




In [30]:
# Import CSV data to a pandas dataframe


### Import Data from Excel File

**pd.read_excel function**

- Default is to import first sheet of an Excel file. **sheet_name** argument defines which sheet the data should come from 
- **usecols** arugment defines which columns to import

In addition, there are many other arguments that can be defined to specify how the file should be interpreted.

Documentation here: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

In [31]:
# Import data to a pandas dataframe


### Import Data Using SQL Query

Here we have simply defined an SQL query we want to use to retrieve data from our database.

In [26]:
sql_query = """SELECT 
 
    hist.FactID,
    hist.Date,
    hist.[Open],
    hist.High,
    hist.Low,
    hist.[Close],
    hist.AdjClose,
    hist.Volume,
 
    sec.Company,
    sec.Symbol,
    sec.Industry,
    sec.IndexWeighting,
 
    exc.Symbol AS Exchange
 
FROM [dbo].[FactPrices_Daily] AS hist
 
   INNER JOIN [dbo].[dimSecurity] AS sec 
      ON hist.SecurityID = sec.ID
      
   INNER JOIN [dbo].[dimExchange] AS exc 
      ON sec.ExchangeID = exc.ID
;"""

### Import Data from SQL Database Using pyodbc & sqlalchemy (Windows only syntax)

Depending on the SQL server type being used, and the drivers on your computer, the syntax below may be slightly different.

You may need to download ODBC Driver 18 from here: https://go.microsoft.com/fwlink/?linkid=2214634

In [27]:
import pyodbc
import os
import urllib
from sqlalchemy import create_engine
# Some other example server values are
# server = 'localhost\sqlexpress' # for a named instance
# server = 'myserver,port' # to specify an alternate port
driver = '{ODBC Driver 18 for SQL Server}'
server = 'prod-sql-cfieducation.database.windows.net' 
database = 'StockPricesDW' 
username = 'ReportingUser'
password = 'CFICapitalPartners789#'

connection_string = f'DRIVER={driver};SERVER=tcp:{server};DATABASE={database};UID={username};PWD={password}'
odbc_params = urllib.parse.quote_plus(connection_string)
conn_string = f'mssql+pyodbc:///?odbc_connect={odbc_params}'
engine = create_engine(conn_string)

In [32]:
# Combine the sql query with the sql connection to get data from a database


### Selecting Columns

We have several options to filter the data coming into our analysis.

Options:
- Option 1 (SQL): Select the required columns as part of our SQL SELECT statement.
- Option 2 (Python): Select the desired columns only from the dataframe in Python.
- Option 3 (Python): Drop the NOT required columns from the dataframe.

In [None]:
# Use Option 2 above to create a new dataframe
# Keep only the FactID and AdjClose price columns


In [33]:
# Use Option 2 above to create a new dataframe from the 
# This time drop the FactID and AdjClose price columns


Filter Columns

We can use the `filter` function allows us to select columns based on the criteria we want. 

This is a very powerful way of filtering for columns, as we can systematically call for columns that meet a specific condition without having to check manually what kind of columns we have in our dataframe.

In [34]:
# Using the filter function with regex parameters to find a column that contains a specific word


### Filtering Rows

We can also filter for rows that contain a certain string in a column using the `contains` function as well.

In [35]:
# Using the contains functions to find rows that contains a specific word in a column


We can also filter rows/columns based on logical conditions in the dataframe.

In [None]:
# Filter the rows of the original datframe to include only rows where GradeAverage is A


We can add additional conditions using the & symbol.

### Drop Duplicates

Generally duplicates are unwanted, although sometimes there is a clear reason why they exist. If we know that duplicates are incorrect, we can deal with them.

Lets say that we want to get rid of all duplicate last names. We can first filter for any duplicate last names in our data to see if they exist.

In [None]:
# Identify any rows with duplicated last names


We can see that with the combination of the duplicated function and filtering with it, we found the rows that have duplicated last names. We can now use the drop_duplicates function to easily remove these.

In [None]:
#drop duplicates column

# Show how we check for dropped duplicate 
# df_grades, look for specific index that was dropped 

### Exercise 1  - Importing our data 

Import the csv file titled "phone_marketplace_dataset_cleaning_set.csv" and set it as a dataframe. This dataset contains information on used phone sales that happened in various marketplace platforms.

Task:
- Use the correct pandas function to import the csv file as a dataframe
- Assign the imported dataframe to a variable

In [1]:
# Import the phone_marketplace_dataset_cleaning_set.csv file into a dataframe


### Exercise 2  - Conditional Filtering

We realized that there may be potential errors in the csv file in which we must deal with. Luckily, we are able to find out the errors came from data which came from craigslist.

Task:
- Filter for data that have craigslist as a marketplace
- From the data that have craigslist as a marketplace, filter for only iPhone 11s

In [None]:
# Filter the data that has craigstlist as marketplace


In [None]:
# Add an additional filter for iphone 11 only.
