The world knows a lot of controllers for playing video games through the movement of the body and its individual parts. But there is no application capable of replacing these controllers with a regular webcam, which almost every PC user has.
Given the interest in such projects from both players and developers, the implementation of this idea seemed very interesting to me. Let's go directly to its analysis.
The entire cycle of implementing the game control mechanism through the camera can be divided into 3 stages:
- User data collection;
- Training ML models on the collected data;
- Assigning keys for the keyboard and mouse emulator, testing the work of trained models; We will repeat this cycle for two games: the classic Mario platformer and the Punch a Bunch boxing simulator. We will control the first game with our palms, and the second with our whole body
First, we will designate the poses that we want to recognize to activate certain commands (about them in paragraph 5). For the Punch a Bunch game initialize 4 poses: basic stand, block, right and left hook. To play Mario, initialize 5 poses. For the right hand: a state of rest and jump, for the left hand: a state of rest, forward and backward movement
Now let's get down to the data collection itself. Let's go through the list with poses, for one minute we will collect data for each of them and record the results in a .csv file. We will give the user 5 seconds between each pose to change it.
For convenience, I implemented a simple interface: we can see the time remaining to collect data on the current pose, name of current pose and the path along which the data recording.
I use gradient boosting and random forest models built into keras. Linear models and fully-connected neural networks can also be used.
To interact with the game, you first need to assign a specific set of actions to each of the poses that will be performed in this pose. To do this, I created a dictionary of the format pose - request. The request consists of nested lists, each of which corresponds to a specific action. One nested query consists of three elements: the device that we are emulating (keyboard or mouse), the action that we want to perform, and the parameter to this action (in our case, the button that we want to hold or release). All this can be seen in the script.
I also wrote a function to decode these requests, consisting almost entirely of if-else constructs (it is also in the script)
Next, using the existing code for getting and outputting labels, we will substitute a trained model, a dictionary of binds and a query decoding function into it. Everything is ready!
Here is an example of the work:
In my opinion, this project has a great potential for development. I plan to continue its development. If it is interesting to you and you want to join, I will be glad of your support