# Data Analyst with SQL

## SELECT

In [2]:
import pandas as pd

In [3]:
eurovision = pd.read_csv("Module1/eurovis.csv")

In [4]:
print(eurovision)

      ID  EventYear      Country  Gender GroupType  Place  Points HostCountry  \
0      1       2009    Lithuania    Male      Solo     23      23        Away   
1      2       2009       Israel  Female     Group     16      53        Away   
2      3       2009       France  Female      Solo      8     107        Away   
3      4       2009       Sweden  Female      Solo     21      33        Away   
4      5       2009      Croatia    Both     Group     18      45        Away   
..   ...        ...          ...     ...       ...    ...     ...         ...   
643  644       2010     Slovenia     NaN       NaN     16       6        Away   
644  645       2010       Sweden     NaN       NaN     11      62        Away   
645  646       2010  Switzerland     NaN       NaN     17       2        Away   
646  647       2010       Turkey     NaN       NaN      1     118        Away   
647  648       2010      Ukraine     NaN       NaN      7      77        Away   

    HostRegion  IsFinal  SF

SELECT the country column FROM the eurovision table.

In [None]:
SELECT Country FROM eurovision;

Amend your query to return the points column instead of the country column.

In [None]:
SELECT points FROM eurovision;

Use TOP to change the existing query so that only the first 50 rows are returned.

In [None]:
SELECT TOP(50) points FROM eurovision;

Return a list of unique countries using DISTINCT. Give the results an alias of unique_country.

In [None]:
SELECT DISTINCT country AS unique_country FROM eurovision;

SELECT the country and event_year columns from the eurovision table.

In [None]:
SELECT country,event_year FROM eurovision;

Use a shortcut to amend the current query, returning ALL columns in the table.

In [None]:
SELECT * FROM eurovision;

This time, return only half the rows using 'TOP', using the same shortcut as before to return all columns.

In [None]:
SELECT TOP(50) PERCENT * FROM eurovision;

## ORDER BY

In this exercise, you'll practice the use of ORDER BY using the grid dataset. It's loaded and waiting for you! It contains a subset of wider publicly available information on US power outages.

Some of the main columns include:

description: The reason/ cause of the outage.
nerc_region: The North American Electricity Reliability Corporation was formed to ensure the reliability of the grid and comprises several regional entities.
demand_loss_mw: How much energy was not transmitted/consumed during the outage.

- Can ORDER BY a column that does not appear in SELECT list

Select description and event_date from grid. Your query should return the first 5 rows, ordered by event_date.




In [None]:
-- Select the top 20 rows from description, nerc_region and event_date
SELECT 
  TOP (20) description,nerc_region,event_date
FROM 
  grid 
  -- Order by nerc_region, affected_customers & event_date
  -- Event_date should be in descending order
ORDER BY
    nerc_region,
    affected_customers,
    event_date desc;

## WHERE

- Don't forget to use single quotes when filtering strings

- You don't need to do this for numeric values, but you DO need to use single quotes for date columns.

- dates are always represented in the YYYY-MM-DD format (Year-Month-Day), which is the default in Microsoft SQL Server.

In [None]:
-- Select description and event_year
SELECT 
  description, 
  event_year
FROM 
  grid 
  -- Filter the results
WHERE
  description = 'Vandalism';

Select the nerc_region and demand_loss_mw columns, limiting the results to those where affected_customers is greater than or equal to 500000 (500,000)

In [None]:
-- Select nerc_region and demand_loss_mw
SELECT 
  nerc_region, 
  demand_loss_mw 
FROM 
  grid 
-- Retrieve rows where affected_customers is >= 500000  (500,000)
WHERE
  affected_customers >= 500000;

Select the nerc_region and demand_loss_mw columns, limiting the results to those where affected_customers is greater than or equal to 500000 (500,000)


In [None]:
-- Select description and affected customers
SELECT 
  description, 
  affected_customers
FROM 
  grid 
  -- Retrieve rows where the event_date was the 22nd December, 2013    
WHERE 
  event_date='20131222';

Limit the results to those where the affected_customers is BETWEEN 50000 and 150000, and order in descending order of event_date.




In [None]:
-- Select description, affected_customers and event date
SELECT 
  description, 
  affected_customers,
  event_date
FROM 
  grid 
  -- The affected_customers column should be >= 50000 and <=150000   
WHERE 
  affected_customers BETWEEN 50000
  AND 150000
   -- Define the order   
ORDER BY
  event_date desc;


## Working with NULL values

A NULL value could mean 'zero' - if something doesn't happen, it can't be logged in a table. However, NULL can also mean 'unknown' or 'missing'. So consider if it is appropriate to replace them in your results. NULL values provide feedback on data quality. If you have NULL values, and you didn't expect to have any, then you have an issue with either how data is captured or how it's entered in the database.

In this exercise, you'll practice filtering for NULL values, excluding them from results, and replacing them with alternative values.

Use a shortcut to select all columns from grid. Then filter the results to only include rows where demand_loss_mw is unknown or missing.

In [None]:
-- Retrieve all columns
SELECT 
  * 
FROM 
  grid 
  -- Return only rows where demand_loss_mw is missing or unknown  
WHERE 
  demand_loss_mw IS NULL;

Adapt your code to return rows where demand_loss_mw is not unknown or missing.

In [None]:
-- Retrieve all columns
SELECT 
  * 
FROM 
  grid 
  -- Return rows where demand_loss_mw is not missing or unknown   
WHERE 
  demand_loss_mw IS NOT NULL;

### Exploring classic rock songs

It's time to rock and roll! In this set of exercises, you'll use the songlist table, which contains songs featured on the playlists of 25 classic rock radio stations.

First, let's get familiar with the data.

Retrieve the song, artist, and release_year columns from the songlist table

Make sure there are no NULL values in the release_year column.

Order the results by artist and release_year.

In [None]:
-- Retrieve the song,artist and release_year columns
SELECT 
  song, 
  artist, 
  release_year 
FROM 
  songlist 
  -- Ensure there are no missing or unknown values in the release_year column
WHERE 
  release_year IS NOT NULL 
  -- Arrange the results by the artist and release_year columns
ORDER BY  
  artist, 
  release_year;

Extend the WHERE clause so that the results are those with a release_year greater than or equal to 1980 and less than or equal to 1990.



In [None]:
SELECT 
  song, 
  artist, 
  release_year
FROM 
  songlist 
WHERE 
  -- Retrieve records greater than and including 1980
  release_year >= '1980'  
  -- Also retrieve records up to and including 1990
  AND release_year <= '1990'
ORDER BY 
  artist, 
  release_year;

## Using parentheses in your queries

You can use parentheses to make the intention of your code clearer. This becomes very important when using AND and OR clauses, to ensure your queries return the exact subsets you need.

Select all artists beginning with B who released tracks in 1986 but also retrieve any records where the release_year is greater than 1990.

In [None]:
SELECT 
  artist, 
  release_year, 
  song 
FROM 
  songlist 
  -- Choose the correct artist and specify the release year
WHERE 
  (
    artist LIKE 'B%' 
    AND release_year = '1986'
  ) 
  -- Or return all songs released after 1990
  OR release_year > '1990' 
  -- Order the results
ORDER BY 
  release_year, 
  artist, 
  song;