# Data analysis
### Gender Swap
Our project focused on swaping the gender of the person in a photo. To accomplish this we need a dataset that consists of pictures of males and females. The dataset chosen for this project was celebA, which consists of more than 202k photos of celebrities, with 40 binary attributes, from which only one is important for us - gender.

In [7]:
import numpy as np
import cv2
import os
face_cascade = cv2.CascadeClassifier("D:\opencv\opencv\data\haarcascades\haarcascade_frontalface_default.xml")
entries = os.listdir('D:/Skola/4.roc/NSIETE/dataset/img_celeba/')
print(len(entries))

202599


![raw_images](img/raw.png)

These pictures are not the same, even the dimensions are different, and the location of the face on the picture is almost never the same, so in order to fix this problem, we have to implement face detection and extract all the faces, resize the images so that all of them will have the same dimensions and finally split them into male/female directories respectively.

In [None]:
for e in entries:
    img = cv2.imread('D:/Skola/4.roc/NSIETE/dataset/img_celeba/'+str(e))
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, 1.3, 5)
    i = 0
    for (x,y,w,h) in faces:
        centerx = x + w / 2
        centery = y + h / 2
        nx = int (centerx - 150)
        ny = int (centery - 150)
        nr = int (150*2)
        roi_gray = gray[y:y+h, x:x+w]
        roi_color = img[y:y+h, x:x+w]
        resized = cv2.resize(roi_color, (256,256))
        cv2.imwrite('D:/Skola/4.roc/NSIETE/dataset/onlyface/'+str(e), resized)
        break

In [8]:
onlyfaces = os.listdir('D:/Skola/4.roc/NSIETE/dataset/onlyface/')
print(len(onlyfaces))

169410


![only_faces](img/onlyfaces.png)

We can see, that faces were not detected on every picture (because some of the photos displayed humans that were not facing the camera), and now we only have a dataset of 169 thousand pictures. So the next thing we have to do is sort out male and female faces.

In [None]:
from shutil import copyfile

entries = os.listdir('D:/Skola/4.roc/NSIETE/dataset/onlyface/')
f = open('celeba-atr.txt','r')
lines = f.readlines()
i = 0
for e in entries:
    line = lines[i].split()
    while (line[0] != e):
        i += 1
        line = lines[i].split()
    if (line[21] == '1'):
        copyfile('D:/Skola/4.roc/NSIETE/dataset/onlyface/'+e,'D:/Skola/4.roc/NSIETE/dataset/male/'+e)
    else:
        copyfile('D:/Skola/4.roc/NSIETE/dataset/onlyface/'+e,'D:/Skola/4.roc/NSIETE/dataset/female/'+e)

In [10]:
males = os.listdir('D:/Skola/4.roc/NSIETE/dataset/male/')
females = os.listdir('D:/Skola/4.roc/NSIETE/dataset/female/')
print('Males:',len(males),'\nFemales:',len(females))

Males: 68560 
Females: 100850


From this we can see, that we have more pictures of females than males, but it is not a problem for our project. Also the number of pictures seems sufficient.